Skip to main content

Flame Graphs for Python: Visualize CPU Time

Flamegraphs are a visualization technique for call stacks that make CPU hotspots immediately obvious. Instead of reading tables of functions, you see a chart where function width represents time spent—wider boxes = longer execution time. Nesting shows the call stack: if function B is called by function A, B appears directly above A. This visual format turns hours of analysis into a 5-second scan to find where optimization matters most. Flamegraphs were invented by Brendan Gregg for systems performance analysis, and they're equally powerful for Python profiling.

Flamegraphs transformed how I approach performance work. Rather than staring at cProfile output trying to mentally rank functions by cumtime, I load a flamegraph and the hotspots leap out visually. Once I located a 60% CPU block (a JSON parsing function), I could immediately see it was called by a data-ingestion loop, called 100,000 times. The visual representation made the problem and solution obvious in seconds.

How Flamegraphs Work: Reading the Visualization

Each flamegraph element represents a function in a call stack:

Horizontal position and width: Time spent in that function (and its callees) relative to the total. A 50% width function consumed 50% of total CPU time.

Vertical position (Y-axis): Call stack depth. Functions at the bottom are called first; functions at the top are deep in the call tree.

Nesting: If function B is inside function A, B was called by A. Reading from bottom to top traces the call path.

Color: Usually arbitrary (helps distinguish frames), but sometimes color-coded by module or function type (e.g., red = CPU, blue = I/O).

Example interpretation:

┌──────────────────────────────────────────────────┐  ← Top: deepest calls
│ json.loads │ regex.search │ list.append │
├────────────────────────────────────────────────────┤
│ parse_data │ filter │
├────────────────────────────────────────────────────┤
│ process_records │ other │
├────────────────────────────────────────────────────┤
│ main │ ← Bottom: entry point
└────────────────────────────────────────────────────┘

Legend:
- parse_data is 50% of total time
- filter is 25% of total time
- other (excluding main's children) is 25%
- json.loads is called by parse_data and consumes 30% (of parse_data's 50%)

Generating Flamegraphs with py-spy

The easiest way to generate a flamegraph is with py-spy, covered in the previous article:

py-spy record -o profile.svg --duration 60 python your_script.py

This profiles for 60 seconds and outputs an interactive SVG file. Open profile.svg in any web browser. The flamegraph is fully interactive:

  • Click a frame to zoom in on that function and its children.
  • Click "Reset zoom" at the top to zoom back out.
  • Search (Ctrl+F in the browser) to highlight a function name across the flamegraph.

Interpreting a Real Flamegraph: An Example

Here's a realistic scenario: a data processing pipeline that reads, parses, and aggregates data.

import json
import time
import re

def fetch_data():
"""Simulate I/O: slow network call."""
time.sleep(0.5)
return '{"id": 1, "value": 100}' * 1000

def parse_json(data):
"""Parse JSON."""
lines = data.split('},')
return [json.loads('{' + line + '}') if '{' in line else {} for line in lines]

def validate(records):
"""Validate records with regex."""
pattern = re.compile(r'^[0-9]+$')
valid = [r for r in records if r and pattern.match(str(r.get('id', '')))]
return valid

def aggregate(records):
"""Aggregate by summing values."""
total = sum(r.get('value', 0) for r in records)
return total

def main():
for i in range(5):
data = fetch_data()
records = parse_json(data)
records = validate(records)
result = aggregate(records)
print(f"Batch {i}: {result}")

if __name__ == "__main__":
main()

Profile this:

py-spy record -o pipeline.svg --duration 30 python pipeline.py

The flamegraph might look like:

Width Breakdown (approximate):
- fetch_data (sleep): 45% ← I/O bottleneck
- parse_json (json.loads): 30% ← Parsing bottleneck
- validate (regex): 15% ← Validation overhead
- aggregate (sum): 10% ← Negligible

The flamegraph immediately tells you:

  1. I/O (sleep) dominates, but you can't optimize time.sleep() directly. Solution: parallelize with async or threading.
  2. JSON parsing is 30%—consider a faster library like ujson.
  3. Regex validation is 15%; an alternative like string operations might help.

Without a flamegraph, you'd need to read cProfile output, manually rank functions, and mentally estimate their impact. The visual approach is 10× faster.

Interactive Flamegraph Features

Modern flamegraph tools (including the SVGs generated by py-spy) support:

  1. Zoom: Click on a frame to drill down.
  2. Reset: Click "Reset zoom" to return to the full view.
  3. Search: Ctrl+F to highlight a function. All matching frames are highlighted in color.
  4. Sort: Some implementations sort stacks left to right by time (widest on left).

Example: Searching for "json" in the flamegraph highlights all JSON-related frames, showing the total impact of JSON processing across the entire program.

Generating Flamegraphs Without py-spy

If py-spy isn't available, you can use cProfile with a flamegraph generator:

python -m cProfile -o program.prof your_script.py
python -m pstats program.prof > report.txt

Then convert the pstats output to a flamegraph using tools like flamegraph.py or web services. However, py-spy is simpler and more accurate for production use.

Flamegraph Best Practices

1. Profile under realistic load. A flamegraph of idle code is useless. Ensure your script is doing real work.

2. Collect enough samples. Run for 30+ seconds (py-spy's default is good). Short runs (2–3 seconds) may miss infrequent bottlenecks.

3. Look for width anomalies. If a small function (few lines) has unexpectedly high width, investigate. It's likely called many times.

4. Trace the call stack upward. If a frame is wide, look at its parent (frame directly below) to understand why it's called so often.

5. Compare multiple runs. Profile 3–5 times. Consistent hotspots are real; one-time anomalies are noise.

Flamegraph Limitations and Alternatives

Limitations:

  • Sampling-based flamegraphs are statistical; very short (microsecond) functions may be underrepresented.
  • Async/await code can be confusing (many internal frames).
  • Large flamegraphs (thousands of unique functions) become hard to read.

Alternatives:

  • Call graph diagrams: Tools like graphviz generate call graphs showing function relationships.
  • Sunburst charts: A circular version of flamegraphs; better for some people visually.
  • Timeline charts: Show function duration over time; useful for detecting performance spikes.
  • Heat maps: Color intensity represents function frequency; good for detecting hotspots across many functions.

For most Python work, flamegraphs are the best choice.

Real-World Example: Optimizing Based on Flamegraph Insights

Suppose your flamegraph shows:

- fetch (40% width)
- network I/O (sleep) (35% of fetch)
- deserialization (5% of fetch)

- process (50% width)
- json.loads (25% of process)
- validation (15% of process)
- aggregation (10% of process)

- other (10% width)

Your optimizations (prioritized by width):

  1. Parallelize fetches (40% → 10%): Use asyncio or threading to fetch multiple records concurrently. Expected speedup: 3–4×.
  2. Switch to ujson (25% → 10%): Replace json.loads with ujson.loads. Expected speedup: 2×.
  3. Optimize validation (15% → 8%): Replace regex with simple string operations. Expected speedup: 2×.

Total expected speedup: approximately 1 / ((40 + 25 + 15) / 100 * 0.5 + 20 / 100 + 10 / 100) ≈ 2–3×. The flamegraph guided exactly which optimizations would have the most impact.

Key Takeaways

  • Flamegraphs visualize call stacks where frame width = time spent, making CPU hotspots obvious at a glance.
  • Generate flamegraphs with py-spy record -o output.svg, then open in a web browser.
  • Click to zoom, search to filter, and stack nesting shows call relationships.
  • A single flamegraph often reveals the top 2–3 optimization opportunities immediately.
  • Combine flamegraphs with your optimization workflow: profile, identify the widest frame, optimize that path, re-profile to confirm improvement.

Frequently Asked Questions

Why is a function wide if it doesn't take long per call?

It's probably called many times. Check ncalls in cProfile or look at the parent function above it in the flamegraph. High-call-count functions are optimization opportunities (reduce calls via caching, batching, better algorithm).

Can I color-code flamegraphs by module?

Yes, many flamegraph generators support color customization. py-spy's default coloring is arbitrary, but some tools let you color by module. Check the flamegraph tool's documentation.

How do I share flamegraphs with teammates?

SVG files are standalone and open in any browser. Export as PNG for presentations. Some teams use online flamegraph viewers like speedscope.app.

What if my flamegraph shows most time in [idle] or [os]?

You're probably profiling code that's blocked on I/O or waiting. Switch focus to async/parallelization rather than CPU optimization. Or increase your load so CPU time is measurable.

Further Reading