Flame Graphs for Python: Visualize CPU Time
Flamegraphs are a visualization technique for call stacks that make CPU hotspots immediately obvious. Instead of reading tables of functions, you see a chart where function width represents time spent—wider boxes = longer execution time. Nesting shows the call stack: if function B is called by function A, B appears directly above A. This visual format turns hours of analysis into a 5-second scan to find where optimization matters most. Flamegraphs were invented by Brendan Gregg for systems performance analysis, and they're equally powerful for Python profiling.
Flamegraphs transformed how I approach performance work. Rather than staring at cProfile output trying to mentally rank functions by cumtime, I load a flamegraph and the hotspots leap out visually. Once I located a 60% CPU block (a JSON parsing function), I could immediately see it was called by a data-ingestion loop, called 100,000 times. The visual representation made the problem and solution obvious in seconds.
How Flamegraphs Work: Reading the Visualization
Each flamegraph element represents a function in a call stack:
Horizontal position and width: Time spent in that function (and its callees) relative to the total. A 50% width function consumed 50% of total CPU time.
Vertical position (Y-axis): Call stack depth. Functions at the bottom are called first; functions at the top are deep in the call tree.
Nesting: If function B is inside function A, B was called by A. Reading from bottom to top traces the call path.
Color: Usually arbitrary (helps distinguish frames), but sometimes color-coded by module or function type (e.g., red = CPU, blue = I/O).
Example interpretation:
┌──────────────────────────────────────────────────┐ ← Top: deepest calls
│ json.loads │ regex.search │ list.append │
├────────────────────────────────────────────────────┤
│ parse_data │ filter │
├────────────────────────────────────────────────────┤
│ process_records │ other │
├────────────────────────────────────────────────────┤
│ main │ ← Bottom: entry point
└────────────────────────────────────────────────────┘
Legend:
- parse_data is 50% of total time
- filter is 25% of total time
- other (excluding main's children) is 25%
- json.loads is called by parse_data and consumes 30% (of parse_data's 50%)
Generating Flamegraphs with py-spy
The easiest way to generate a flamegraph is with py-spy, covered in the previous article:
py-spy record -o profile.svg --duration 60 python your_script.py
This profiles for 60 seconds and outputs an interactive SVG file. Open profile.svg in any web browser. The flamegraph is fully interactive:
- Click a frame to zoom in on that function and its children.
- Click "Reset zoom" at the top to zoom back out.
- Search (Ctrl+F in the browser) to highlight a function name across the flamegraph.
Interpreting a Real Flamegraph: An Example
Here's a realistic scenario: a data processing pipeline that reads, parses, and aggregates data.
import json
import time
import re
def fetch_data():
"""Simulate I/O: slow network call."""
time.sleep(0.5)
return '{"id": 1, "value": 100}' * 1000
def parse_json(data):
"""Parse JSON."""
lines = data.split('},')
return [json.loads('{' + line + '}') if '{' in line else {} for line in lines]
def validate(records):
"""Validate records with regex."""
pattern = re.compile(r'^[0-9]+$')
valid = [r for r in records if r and pattern.match(str(r.get('id', '')))]
return valid
def aggregate(records):
"""Aggregate by summing values."""
total = sum(r.get('value', 0) for r in records)
return total
def main():
for i in range(5):
data = fetch_data()
records = parse_json(data)
records = validate(records)
result = aggregate(records)
print(f"Batch {i}: {result}")
if __name__ == "__main__":
main()
Profile this:
py-spy record -o pipeline.svg --duration 30 python pipeline.py
The flamegraph might look like:
Width Breakdown (approximate):
- fetch_data (sleep): 45% ← I/O bottleneck
- parse_json (json.loads): 30% ← Parsing bottleneck
- validate (regex): 15% ← Validation overhead
- aggregate (sum): 10% ← Negligible
The flamegraph immediately tells you:
- I/O (sleep) dominates, but you can't optimize
time.sleep()directly. Solution: parallelize with async or threading. - JSON parsing is 30%—consider a faster library like
ujson. - Regex validation is 15%; an alternative like string operations might help.
Without a flamegraph, you'd need to read cProfile output, manually rank functions, and mentally estimate their impact. The visual approach is 10× faster.
Interactive Flamegraph Features
Modern flamegraph tools (including the SVGs generated by py-spy) support:
- Zoom: Click on a frame to drill down.
- Reset: Click "Reset zoom" to return to the full view.
- Search: Ctrl+F to highlight a function. All matching frames are highlighted in color.
- Sort: Some implementations sort stacks left to right by time (widest on left).
Example: Searching for "json" in the flamegraph highlights all JSON-related frames, showing the total impact of JSON processing across the entire program.
Generating Flamegraphs Without py-spy
If py-spy isn't available, you can use cProfile with a flamegraph generator:
python -m cProfile -o program.prof your_script.py
python -m pstats program.prof > report.txt
Then convert the pstats output to a flamegraph using tools like flamegraph.py or web services. However, py-spy is simpler and more accurate for production use.
Flamegraph Best Practices
1. Profile under realistic load. A flamegraph of idle code is useless. Ensure your script is doing real work.
2. Collect enough samples. Run for 30+ seconds (py-spy's default is good). Short runs (2–3 seconds) may miss infrequent bottlenecks.
3. Look for width anomalies. If a small function (few lines) has unexpectedly high width, investigate. It's likely called many times.
4. Trace the call stack upward. If a frame is wide, look at its parent (frame directly below) to understand why it's called so often.
5. Compare multiple runs. Profile 3–5 times. Consistent hotspots are real; one-time anomalies are noise.
Flamegraph Limitations and Alternatives
Limitations:
- Sampling-based flamegraphs are statistical; very short (microsecond) functions may be underrepresented.
- Async/await code can be confusing (many internal frames).
- Large flamegraphs (thousands of unique functions) become hard to read.
Alternatives:
- Call graph diagrams: Tools like
graphvizgenerate call graphs showing function relationships. - Sunburst charts: A circular version of flamegraphs; better for some people visually.
- Timeline charts: Show function duration over time; useful for detecting performance spikes.
- Heat maps: Color intensity represents function frequency; good for detecting hotspots across many functions.
For most Python work, flamegraphs are the best choice.
Real-World Example: Optimizing Based on Flamegraph Insights
Suppose your flamegraph shows:
- fetch (40% width)
- network I/O (sleep) (35% of fetch)
- deserialization (5% of fetch)
- process (50% width)
- json.loads (25% of process)
- validation (15% of process)
- aggregation (10% of process)
- other (10% width)
Your optimizations (prioritized by width):
- Parallelize fetches (40% → 10%): Use
asyncioor threading to fetch multiple records concurrently. Expected speedup: 3–4×. - Switch to ujson (25% → 10%): Replace
json.loadswithujson.loads. Expected speedup: 2×. - Optimize validation (15% → 8%): Replace regex with simple string operations. Expected speedup: 2×.
Total expected speedup: approximately 1 / ((40 + 25 + 15) / 100 * 0.5 + 20 / 100 + 10 / 100) ≈ 2–3×. The flamegraph guided exactly which optimizations would have the most impact.
Key Takeaways
- Flamegraphs visualize call stacks where frame width = time spent, making CPU hotspots obvious at a glance.
- Generate flamegraphs with
py-spy record -o output.svg, then open in a web browser. - Click to zoom, search to filter, and stack nesting shows call relationships.
- A single flamegraph often reveals the top 2–3 optimization opportunities immediately.
- Combine flamegraphs with your optimization workflow: profile, identify the widest frame, optimize that path, re-profile to confirm improvement.
Frequently Asked Questions
Why is a function wide if it doesn't take long per call?
It's probably called many times. Check ncalls in cProfile or look at the parent function above it in the flamegraph. High-call-count functions are optimization opportunities (reduce calls via caching, batching, better algorithm).
Can I color-code flamegraphs by module?
Yes, many flamegraph generators support color customization. py-spy's default coloring is arbitrary, but some tools let you color by module. Check the flamegraph tool's documentation.
How do I share flamegraphs with teammates?
SVG files are standalone and open in any browser. Export as PNG for presentations. Some teams use online flamegraph viewers like speedscope.app.
What if my flamegraph shows most time in [idle] or [os]?
You're probably profiling code that's blocked on I/O or waiting. Switch focus to async/parallelization rather than CPU optimization. Or increase your load so CPU time is measurable.