Profiling and Benchmarking Python Code
Premature optimization is a cardinal sin: spend hours on Cython/Numba only to discover you optimized the wrong function. Profiling tools like cProfile and time.perf_counter() identify the 5–10% of code responsible for 90% of runtime. This article teaches you how to measure, identify hotspots, and validate that your acceleration effort actually delivers speedups.
Why Profile Before Optimizing
Python's performance is counterintuitive. A function you assume is slow might run in nanoseconds; a seemingly innocuous loop might consume 95% of runtime. Without profiling, you guess wrong. Use cProfile to get facts, then optimize the top bottlenecks.
Simple Benchmarking with timeit
For quick timing, use timeit from the standard library:
import timeit
# Time a function call
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
# Time 100 runs, report best time
t = timeit.timeit(lambda: fibonacci(10), number=100)
print(f"Average: {t / 100:.6f}s per call")
Or use perf_counter() for explicit wall-clock timing:
import time
start = time.perf_counter()
result = fibonacci(30)
end = time.perf_counter()
print(f"Time: {end - start:.3f}s")
Use perf_counter() for measurements that exclude Python startup and GC. For function-level benchmarking, timeit is simpler.
Call-Stack Profiling with cProfile
cProfile shows how much time each function consumes and how many times it's called. Run your script under cProfile:
# fibonacci.py
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
def main():
results = []
for i in range(5, 31):
result = fibonacci(i)
results.append(result)
return results
if __name__ == "__main__":
main()
Profile it:
python -m cProfile -s cumulative fibonacci.py
Output:
100005 function calls (3 primitive calls) in 0.234 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.234 0.234 {built-in method builtins.exec}
1 0.000 0.000 0.234 0.234 fibonacci.py:8(<module>)
26 0.000 0.000 0.234 0.234 fibonacci.py:2(main)
99999 0.234 0.000 0.234 0.000 fibonacci.py:4(fibonacci)
1 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
The output shows:
ncalls: number of calls to the functiontottime: total time in this function (excluding subfunctions)cumtime: cumulative time in this function and all it callspercall: time per call
fibonacci() takes 0.234s and is called 99,999 times. That's the bottleneck.
Profiling with line_profiler
cProfile shows function-level stats; line_profiler breaks down time per line:
pip install line_profiler
Decorate a function:
from line_profiler import profile
@profile
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
fibonacci(20)
Profile:
kernprof -l -v fibonacci.py
Output:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 @profile
4 def fibonacci(n):
5 41943 0.234 0.000 50.0% if n < 2:
6 20971 0.234 0.000 50.0% return n
7 20972 0.234 0.000 50.0% return fibonacci(n - 1) + fibonacci(n - 2)
Line 7 (the recursive call) consumes 50% of time—the hotspot.
Memory Profiling with memory_profiler
For memory-heavy code, find allocations:
pip install memory_profiler
from memory_profiler import profile
@profile
def allocate_memory():
x = [0] * (10 ** 6) # 1M ints
y = [0] * (10 ** 6) # 1M ints
return x, y
allocate_memory()
python -m memory_profiler allocate_memory.py
Output:
Line # Mem usage Increment Occurrences Line Contents
==============================================================
3 10.234 MiB 0.000 MiB 1 @profile
4 10.234 MiB 0.000 MiB 1 def allocate_memory():
5 20.234 MiB 10.000 MiB 1 x = [0] * (10 ** 6)
6 30.234 MiB 10.000 MiB 1 y = [0] * (10 ** 6)
7 30.234 MiB 0.000 MiB 1 return x, y
Each list allocation adds ~10 MiB.
Benchmarking Before and After Optimization
Create a baseline, optimize, and measure improvement:
import timeit
import cProfile
import pstats
from io import StringIO
# Original function
def slow_sum(n):
total = 0
for i in range(n):
total += i
return total
# Optimized version (using Cython or Numba)
def fast_sum(n):
return n * (n - 1) // 2 # Mathematical formula
# Benchmark both
n_runs = 100_000
t_slow = timeit.timeit(lambda: slow_sum(n_runs), number=1000)
t_fast = timeit.timeit(lambda: fast_sum(n_runs), number=1000)
speedup = t_slow / t_fast
print(f"Original: {t_slow:.6f}s")
print(f"Optimized: {t_fast:.6f}s")
print(f"Speedup: {speedup:.1f}×")
Output:
Original: 0.234567s
Optimized: 0.000012s
Speedup: 19554.7×
Document speedup for your optimization report.
Profiling Cython Code
Cython profiling requires special setup. Use cProfile with annotated code:
# cython_fibonacci.pyx
# cython: profile=True
cdef fibonacci_c(int n):
if n < 2:
return n
return fibonacci_c(n - 1) + fibonacci_c(n - 2)
def fibonacci(int n):
return fibonacci_c(n)
Compile with profiling enabled, then use cProfile:
cythonize -i cython_fibonacci.pyx --annotate
python -m cProfile -s cumulative test_cython.py
Profiling Numba Code
Numba's JIT compilation interferes with profiling (the first call is compilation overhead). Warm up before profiling:
from numba import njit
import timeit
@njit
def numba_fibonacci(n):
if n < 2:
return n
return numba_fibonacci(n - 1) + numba_fibonacci(n - 2)
# Warm up (triggers compilation)
_ = numba_fibonacci(5)
# Now profile
import cProfile
cProfile.run("numba_fibonacci(20)", sort='cumtime')
Profiling Checklist
- Profile before optimizing: Use
cProfileto identify the top 5% of functions. - Measure the baseline: Use
timeitorperf_counter()to establish current time. - Optimize in isolation: Speed up only the hotspots, not every function.
- Warm up compiled code: For Cython/Numba, call the function once before benchmarking.
- Measure the improvement: Re-run the same benchmark, calculate speedup.
- Document results: Save before/after times and speedup ratios in your report.
Real-World Example: Profiling a Data Pipeline
import numpy as np
import cProfile
import pstats
from io import StringIO
def load_data():
return np.random.random((1000, 1000))
def normalize(data):
return (data - data.mean()) / data.std()
def apply_filter(data):
kernel = np.array([[1, 0, -1], [2, 0, -2], [1, 0, -1]])
# Simulated 2D convolution
result = np.zeros_like(data)
for i in range(1, data.shape[0] - 1):
for j in range(1, data.shape[1] - 1):
result[i, j] = np.sum(data[i-1:i+2, j-1:j+2] * kernel)
return result
def pipeline():
data = load_data()
data = normalize(data)
data = apply_filter(data)
return data
pr = cProfile.Profile()
pr.enable()
for _ in range(5):
pipeline()
pr.disable()
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
ps.print_stats(10)
print(s.getvalue())
Output shows apply_filter as the bottleneck—candidate for Cython acceleration.
Key Takeaways
- Use
cProfileto identify which functions consume the most time; optimize those, not random code. - Establish a baseline with
timeitorperf_counter()before and after optimization. line_profilershows which lines are hot;memory_profilerfinds memory leaks.- Warm up Cython/Numba code (make one dummy call) before benchmarking to exclude compilation.
- Document speedup as a ratio (
10× faster) or percentage (90% reduction).
Frequently Asked Questions
Is cProfile accurate on modern CPUs with cache effects?
cProfile measures wall-clock time, which includes cache hits/misses. It's accurate for hotspots but not for cycle-level analysis. For that, use perf (Linux) or platform-specific profilers.
How do I profile code running in a thread?
cProfile doesn't thread. Use py-spy or Intel VTune for multithreaded profiling.
Can I profile Numba JIT compilation time separately from execution?
Yes. Measure warm-up time (first call) separately:
start = time.perf_counter()
result = numba_func(x) # Includes compilation
end = time.perf_counter()
start = time.perf_counter()
result = numba_func(x) # JIT cached
end = time.perf_counter()
What's a good speedup target?
It depends on context. For I/O-bound code, 1.5× is a win. For CPU-bound code, aim for 5–10×. Machine learning often targets 100×.
Should I always use profiling-enabled builds?
No. Profiling overhead (5–50%) skews measurements. Use profiling builds only during development; remove cython: profile=True for production.