Sampling Profilers: Statistical Performance Analysis
Sampling profilers interrupt your code at fixed intervals (typically every 10 milliseconds) and record the current call stack, building a statistical picture of where time is spent. Unlike deterministic profilers like cProfile, which count every call (high overhead), sampling profilers capture snapshots (low overhead, under 5%) and infer hotspots from frequency. They're ideal for profiling production code, long-running processes, and applications where 10–50% overhead is unacceptable.
I switched to sampling profilers after a production database migration took 18 hours instead of the expected 2 hours. Deterministic profiling added so much overhead that the code ran at half speed, giving useless measurements. Sampling with py-spy revealed the actual bottleneck in 5 minutes with near-zero production impact: a missing index on a lookup query. Sampling profilers are your weapon for real-world performance debugging.
Sampling vs. Deterministic Profiling: When to Use Each
| Aspect | Deterministic (cProfile) | Sampling (py-spy) |
|---|---|---|
| Overhead | 10–50% (significant) | <5% (minimal) |
| Accuracy | Exact: counts every call | Statistical: infers from samples |
| Output size | Large (millions of calls) | Compact (call stacks) |
| Best for | Development, identifying bottlenecks | Production, long-running processes |
| Short fast functions | Measured accurately | May be missed (sampling too sparse) |
| Async code | Problematic (counts many internal calls) | Works well (sees actual user code) |
Use cProfile in development to identify the slow function. Use py-spy to confirm the bottleneck in production or when overhead matters.
Installing py-spy
pip install py-spy
Or download a pre-built binary from the GitHub releases page.
Running py-spy: Record and View
Option 1: Record a script to a flamegraph (covered in the next article):
py-spy record -o profile.svg --duration 60 python your_script.py
This profiles your script for 60 seconds and saves a flamegraph as an SVG file (viewable in a web browser).
Option 2: Attach to a running process:
py-spy record -o profile.svg -p 12345
Replace 12345 with the process ID (PID). This is production magic: monitor a running server without stopping it or modifying code.
Option 3: View call stacks in the terminal:
py-spy top -p 12345
Real-time top-like interface showing which functions are consuming CPU right now.
Example: Profiling a Data Processing Script
Create a script with intentional bottlenecks:
import time
import json
def fetch_data():
"""Simulate I/O: reading from disk or network."""
time.sleep(0.1)
return [{"id": i, "value": i * 2} for i in range(1000)]
def parse_json(data):
"""Simulate JSON parsing."""
return [json.dumps(d) for d in data]
def aggregate(data):
"""Aggregate data by computing a sum."""
total = sum(int(d["id"]) for d in data)
return total
def main():
"""Main workflow."""
for iteration in range(10):
print(f"Iteration {iteration}...")
data = fetch_data()
json_data = parse_json(data)
result = aggregate(data)
print(f" Result: {result}")
if __name__ == "__main__":
main()
Run with py-spy:
py-spy record -o profile.svg --duration 15 python script.py
This creates profile.svg, a flamegraph showing which functions consumed the most CPU. Open it in a web browser. The width of each block represents time spent in that function relative to the total.
Real-Time Monitoring with py-spy top
For interactive, real-time profiling of a running process:
py-spy top -p $(pgrep -f 'python your_script.py')
Output (updating every second):
GIL: 0.00%, Active Threads: 1
%CPU Function Name
40.2 fetch_data.sleep
30.1 parse_json.json.dumps
20.3 aggregate.sum
9.4 main
You immediately see that fetch_data (sleep) consumes 40% and parse_json consumes 30%. The fix: parallelize I/O (fetch_data) and consider a faster JSON library (parse_json).
Interpreting Flamegraph Output
A flamegraph is a visualization where:
- X-axis represents total time (wider = longer).
- Y-axis represents call stack depth (higher = deeper calls).
- Color is arbitrary (helps distinguish frames).
- Each box is a function; clicking it zooms in.
Example flamegraph structure:
┌─────────────────────────────────────────────────┐
│ main [main] │
├─────────────────────────────────────────────────┤
│ fetch_data ││ parse_json ││ agg │
├─────────────────────────────────────────────────┤
│ sleep requests │json.dumps │json.loads │ sum │
└─────────────────────────────────────────────────┘
Wide blocks = time sinks. Your optimization priorities are the widest blocks.
Key Advantage: Profiling Blocking I/O Accurately
With deterministic profilers, I/O calls show as single entries with no breakdown. With sampling profilers, you see exactly where time goes:
import requests
import time
def fetch_external_api():
"""Blocking HTTP request."""
# This sleep simulates network latency
time.sleep(2)
return {"data": "result"}
def process():
for i in range(10):
result = fetch_external_api()
return result
process()
With cProfile, you'd see:
10 calls to fetch_external_api, 20 seconds cumulative
With py-spy, you'd see:
20 seconds in fetch_external_api
20 seconds in sleep (I/O overhead)
0 seconds in processing
The breakdown reveals that all time is I/O; optimizing would require parallelization (async/await) or caching, not code optimization.
Programmatic Sampling with statprof
For low-overhead development profiling (without recording), use statprof, which samples at lower frequency:
pip install statprof
import statprof
def slow_operation():
total = 0
for i in range(10_000_000):
total += i
return total
statprof.start()
slow_operation()
statprof.stop()
statprof.display()
Output shows call stacks with time percentages, similar to py-spy but integrated into your code.
Best Practices for Production Profiling
1. Profile during load. Profiling an idle service is useless. Run profiling during peak traffic or a realistic load test.
2. Use short durations. Even low-overhead sampling adds CPU. Record for 10–60 seconds, not hours.
3. Save results immediately. Use py-spy record -o file.svg to save rather than streaming output, which adds overhead.
4. Combine with metrics. Cross-reference profiling results with performance metrics (response time, throughput, error rate) to confirm findings.
5. Profile multiple times. One profile run is noise. Run 3–5 times and compare. If a function is in the top 5 every run, it's a real bottleneck.
Example: Profiling a Flask Web Server
from flask import Flask
app = Flask(__name__)
def expensive_computation(n):
total = 0
for i in range(n):
total += i ** 2
return total
@app.route("/compute/<int:n>")
def compute(n):
result = expensive_computation(n)
return {"result": result}
if __name__ == "__main__":
app.run(debug=False)
Run the server:
python flask_app.py
In another terminal, attach py-spy:
py-spy record -o flask_profile.svg -p $(pgrep -f 'flask_app.py') --duration 30
Then make requests in a third terminal:
for i in {1..100}; do curl "http://localhost:5000/compute/100000"; done
The flamegraph will show where the 30 seconds went during those 100 requests. If expensive_computation dominates, you know where to optimize.
Key Takeaways
- Sampling profilers like
py-spyadd minimal overhead (<5%), making them safe for production and long-running processes. - They work by recording call stacks at fixed intervals, building a statistical picture of where time is spent.
- Use
py-spy recordto generate flamegraphs, orpy-spy topfor real-time monitoring. - Flamegraph width represents time; wide blocks are optimization targets.
- Combine deterministic profilers (development bottleneck identification) with sampling profilers (production confirmation).
Frequently Asked Questions
Does py-spy work with async/await code?
Yes, better than deterministic profilers. Sampling sees actual user code without being confused by asyncio's many internal function calls.
Can I profile Docker containers with py-spy?
Yes, if you pass the container's process ID from the host (with docker inspect to find the PID). Alternatively, run py-spy inside the container.
What sampling interval should I use?
Default (10 ms) is fine for most cases. For higher resolution, reduce to 5 ms (--sample-rate 200 = 200 samples/second = 5 ms). For lower overhead, increase to 20 ms.
How do I interpret a flamegraph with many thin frames?
Thin frames are functions that run briefly or are called few times. Ignore them; focus on wide frames. If you want to zoom into a thin frame, click it to expand.