Skip to main content

Sampling Profilers: Statistical Performance Analysis

Sampling profilers interrupt your code at fixed intervals (typically every 10 milliseconds) and record the current call stack, building a statistical picture of where time is spent. Unlike deterministic profilers like cProfile, which count every call (high overhead), sampling profilers capture snapshots (low overhead, under 5%) and infer hotspots from frequency. They're ideal for profiling production code, long-running processes, and applications where 10–50% overhead is unacceptable.

I switched to sampling profilers after a production database migration took 18 hours instead of the expected 2 hours. Deterministic profiling added so much overhead that the code ran at half speed, giving useless measurements. Sampling with py-spy revealed the actual bottleneck in 5 minutes with near-zero production impact: a missing index on a lookup query. Sampling profilers are your weapon for real-world performance debugging.

Sampling vs. Deterministic Profiling: When to Use Each

AspectDeterministic (cProfile)Sampling (py-spy)
Overhead10–50% (significant)<5% (minimal)
AccuracyExact: counts every callStatistical: infers from samples
Output sizeLarge (millions of calls)Compact (call stacks)
Best forDevelopment, identifying bottlenecksProduction, long-running processes
Short fast functionsMeasured accuratelyMay be missed (sampling too sparse)
Async codeProblematic (counts many internal calls)Works well (sees actual user code)

Use cProfile in development to identify the slow function. Use py-spy to confirm the bottleneck in production or when overhead matters.

Installing py-spy

pip install py-spy

Or download a pre-built binary from the GitHub releases page.

Running py-spy: Record and View

Option 1: Record a script to a flamegraph (covered in the next article):

py-spy record -o profile.svg --duration 60 python your_script.py

This profiles your script for 60 seconds and saves a flamegraph as an SVG file (viewable in a web browser).

Option 2: Attach to a running process:

py-spy record -o profile.svg -p 12345

Replace 12345 with the process ID (PID). This is production magic: monitor a running server without stopping it or modifying code.

Option 3: View call stacks in the terminal:

py-spy top -p 12345

Real-time top-like interface showing which functions are consuming CPU right now.

Example: Profiling a Data Processing Script

Create a script with intentional bottlenecks:

import time
import json

def fetch_data():
"""Simulate I/O: reading from disk or network."""
time.sleep(0.1)
return [{"id": i, "value": i * 2} for i in range(1000)]

def parse_json(data):
"""Simulate JSON parsing."""
return [json.dumps(d) for d in data]

def aggregate(data):
"""Aggregate data by computing a sum."""
total = sum(int(d["id"]) for d in data)
return total

def main():
"""Main workflow."""
for iteration in range(10):
print(f"Iteration {iteration}...")
data = fetch_data()
json_data = parse_json(data)
result = aggregate(data)
print(f" Result: {result}")

if __name__ == "__main__":
main()

Run with py-spy:

py-spy record -o profile.svg --duration 15 python script.py

This creates profile.svg, a flamegraph showing which functions consumed the most CPU. Open it in a web browser. The width of each block represents time spent in that function relative to the total.

Real-Time Monitoring with py-spy top

For interactive, real-time profiling of a running process:

py-spy top -p $(pgrep -f 'python your_script.py')

Output (updating every second):

GIL: 0.00%, Active Threads: 1
%CPU Function Name
40.2 fetch_data.sleep
30.1 parse_json.json.dumps
20.3 aggregate.sum
9.4 main

You immediately see that fetch_data (sleep) consumes 40% and parse_json consumes 30%. The fix: parallelize I/O (fetch_data) and consider a faster JSON library (parse_json).

Interpreting Flamegraph Output

A flamegraph is a visualization where:

  • X-axis represents total time (wider = longer).
  • Y-axis represents call stack depth (higher = deeper calls).
  • Color is arbitrary (helps distinguish frames).
  • Each box is a function; clicking it zooms in.

Example flamegraph structure:

┌─────────────────────────────────────────────────┐
│ main [main] │
├─────────────────────────────────────────────────┤
│ fetch_data ││ parse_json ││ agg │
├─────────────────────────────────────────────────┤
│ sleep requests │json.dumps │json.loads │ sum │
└─────────────────────────────────────────────────┘

Wide blocks = time sinks. Your optimization priorities are the widest blocks.

Key Advantage: Profiling Blocking I/O Accurately

With deterministic profilers, I/O calls show as single entries with no breakdown. With sampling profilers, you see exactly where time goes:

import requests
import time

def fetch_external_api():
"""Blocking HTTP request."""
# This sleep simulates network latency
time.sleep(2)
return {"data": "result"}

def process():
for i in range(10):
result = fetch_external_api()
return result

process()

With cProfile, you'd see:

10 calls to fetch_external_api, 20 seconds cumulative

With py-spy, you'd see:

20 seconds in fetch_external_api
20 seconds in sleep (I/O overhead)
0 seconds in processing

The breakdown reveals that all time is I/O; optimizing would require parallelization (async/await) or caching, not code optimization.

Programmatic Sampling with statprof

For low-overhead development profiling (without recording), use statprof, which samples at lower frequency:

pip install statprof
import statprof

def slow_operation():
total = 0
for i in range(10_000_000):
total += i
return total

statprof.start()
slow_operation()
statprof.stop()
statprof.display()

Output shows call stacks with time percentages, similar to py-spy but integrated into your code.

Best Practices for Production Profiling

1. Profile during load. Profiling an idle service is useless. Run profiling during peak traffic or a realistic load test.

2. Use short durations. Even low-overhead sampling adds CPU. Record for 10–60 seconds, not hours.

3. Save results immediately. Use py-spy record -o file.svg to save rather than streaming output, which adds overhead.

4. Combine with metrics. Cross-reference profiling results with performance metrics (response time, throughput, error rate) to confirm findings.

5. Profile multiple times. One profile run is noise. Run 3–5 times and compare. If a function is in the top 5 every run, it's a real bottleneck.

Example: Profiling a Flask Web Server

from flask import Flask

app = Flask(__name__)

def expensive_computation(n):
total = 0
for i in range(n):
total += i ** 2
return total

@app.route("/compute/<int:n>")
def compute(n):
result = expensive_computation(n)
return {"result": result}

if __name__ == "__main__":
app.run(debug=False)

Run the server:

python flask_app.py

In another terminal, attach py-spy:

py-spy record -o flask_profile.svg -p $(pgrep -f 'flask_app.py') --duration 30

Then make requests in a third terminal:

for i in {1..100}; do curl "http://localhost:5000/compute/100000"; done

The flamegraph will show where the 30 seconds went during those 100 requests. If expensive_computation dominates, you know where to optimize.

Key Takeaways

  • Sampling profilers like py-spy add minimal overhead (<5%), making them safe for production and long-running processes.
  • They work by recording call stacks at fixed intervals, building a statistical picture of where time is spent.
  • Use py-spy record to generate flamegraphs, or py-spy top for real-time monitoring.
  • Flamegraph width represents time; wide blocks are optimization targets.
  • Combine deterministic profilers (development bottleneck identification) with sampling profilers (production confirmation).

Frequently Asked Questions

Does py-spy work with async/await code?

Yes, better than deterministic profilers. Sampling sees actual user code without being confused by asyncio's many internal function calls.

Can I profile Docker containers with py-spy?

Yes, if you pass the container's process ID from the host (with docker inspect to find the PID). Alternatively, run py-spy inside the container.

What sampling interval should I use?

Default (10 ms) is fine for most cases. For higher resolution, reduce to 5 ms (--sample-rate 200 = 200 samples/second = 5 ms). For lower overhead, increase to 20 ms.

How do I interpret a flamegraph with many thin frames?

Thin frames are functions that run briefly or are called few times. Ignore them; focus on wide frames. If you want to zoom into a thin frame, click it to expand.

Further Reading