Skip to main content

Profiling and Benchmarking Python Code

Premature optimization is a cardinal sin: spend hours on Cython/Numba only to discover you optimized the wrong function. Profiling tools like cProfile and time.perf_counter() identify the 5–10% of code responsible for 90% of runtime. This article teaches you how to measure, identify hotspots, and validate that your acceleration effort actually delivers speedups.

Why Profile Before Optimizing

Python's performance is counterintuitive. A function you assume is slow might run in nanoseconds; a seemingly innocuous loop might consume 95% of runtime. Without profiling, you guess wrong. Use cProfile to get facts, then optimize the top bottlenecks.

Simple Benchmarking with timeit

For quick timing, use timeit from the standard library:

import timeit

# Time a function call
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)

# Time 100 runs, report best time
t = timeit.timeit(lambda: fibonacci(10), number=100)
print(f"Average: {t / 100:.6f}s per call")

Or use perf_counter() for explicit wall-clock timing:

import time

start = time.perf_counter()
result = fibonacci(30)
end = time.perf_counter()
print(f"Time: {end - start:.3f}s")

Use perf_counter() for measurements that exclude Python startup and GC. For function-level benchmarking, timeit is simpler.

Call-Stack Profiling with cProfile

cProfile shows how much time each function consumes and how many times it's called. Run your script under cProfile:

# fibonacci.py
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)

def main():
results = []
for i in range(5, 31):
result = fibonacci(i)
results.append(result)
return results

if __name__ == "__main__":
main()

Profile it:

python -m cProfile -s cumulative fibonacci.py

Output:

         100005 function calls (3 primitive calls) in 0.234 seconds
Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.234 0.234 {built-in method builtins.exec}
1 0.000 0.000 0.234 0.234 fibonacci.py:8(<module>)
26 0.000 0.000 0.234 0.234 fibonacci.py:2(main)
99999 0.234 0.000 0.234 0.000 fibonacci.py:4(fibonacci)
1 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}

The output shows:

  • ncalls: number of calls to the function
  • tottime: total time in this function (excluding subfunctions)
  • cumtime: cumulative time in this function and all it calls
  • percall: time per call

fibonacci() takes 0.234s and is called 99,999 times. That's the bottleneck.

Profiling with line_profiler

cProfile shows function-level stats; line_profiler breaks down time per line:

pip install line_profiler

Decorate a function:

from line_profiler import profile

@profile
def fibonacci(n):
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)

fibonacci(20)

Profile:

kernprof -l -v fibonacci.py

Output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
3 @profile
4 def fibonacci(n):
5 41943 0.234 0.000 50.0% if n < 2:
6 20971 0.234 0.000 50.0% return n
7 20972 0.234 0.000 50.0% return fibonacci(n - 1) + fibonacci(n - 2)

Line 7 (the recursive call) consumes 50% of time—the hotspot.

Memory Profiling with memory_profiler

For memory-heavy code, find allocations:

pip install memory_profiler
from memory_profiler import profile

@profile
def allocate_memory():
x = [0] * (10 ** 6) # 1M ints
y = [0] * (10 ** 6) # 1M ints
return x, y

allocate_memory()
python -m memory_profiler allocate_memory.py

Output:

Line #    Mem usage    Increment  Occurrences   Line Contents
==============================================================
3 10.234 MiB 0.000 MiB 1 @profile
4 10.234 MiB 0.000 MiB 1 def allocate_memory():
5 20.234 MiB 10.000 MiB 1 x = [0] * (10 ** 6)
6 30.234 MiB 10.000 MiB 1 y = [0] * (10 ** 6)
7 30.234 MiB 0.000 MiB 1 return x, y

Each list allocation adds ~10 MiB.

Benchmarking Before and After Optimization

Create a baseline, optimize, and measure improvement:

import timeit
import cProfile
import pstats
from io import StringIO

# Original function
def slow_sum(n):
total = 0
for i in range(n):
total += i
return total

# Optimized version (using Cython or Numba)
def fast_sum(n):
return n * (n - 1) // 2 # Mathematical formula

# Benchmark both
n_runs = 100_000

t_slow = timeit.timeit(lambda: slow_sum(n_runs), number=1000)
t_fast = timeit.timeit(lambda: fast_sum(n_runs), number=1000)

speedup = t_slow / t_fast
print(f"Original: {t_slow:.6f}s")
print(f"Optimized: {t_fast:.6f}s")
print(f"Speedup: {speedup:.1f}×")

Output:

Original: 0.234567s
Optimized: 0.000012s
Speedup: 19554.7×

Document speedup for your optimization report.

Profiling Cython Code

Cython profiling requires special setup. Use cProfile with annotated code:

# cython_fibonacci.pyx
# cython: profile=True

cdef fibonacci_c(int n):
if n < 2:
return n
return fibonacci_c(n - 1) + fibonacci_c(n - 2)

def fibonacci(int n):
return fibonacci_c(n)

Compile with profiling enabled, then use cProfile:

cythonize -i cython_fibonacci.pyx --annotate
python -m cProfile -s cumulative test_cython.py

Profiling Numba Code

Numba's JIT compilation interferes with profiling (the first call is compilation overhead). Warm up before profiling:

from numba import njit
import timeit

@njit
def numba_fibonacci(n):
if n < 2:
return n
return numba_fibonacci(n - 1) + numba_fibonacci(n - 2)

# Warm up (triggers compilation)
_ = numba_fibonacci(5)

# Now profile
import cProfile
cProfile.run("numba_fibonacci(20)", sort='cumtime')

Profiling Checklist

  1. Profile before optimizing: Use cProfile to identify the top 5% of functions.
  2. Measure the baseline: Use timeit or perf_counter() to establish current time.
  3. Optimize in isolation: Speed up only the hotspots, not every function.
  4. Warm up compiled code: For Cython/Numba, call the function once before benchmarking.
  5. Measure the improvement: Re-run the same benchmark, calculate speedup.
  6. Document results: Save before/after times and speedup ratios in your report.

Real-World Example: Profiling a Data Pipeline

import numpy as np
import cProfile
import pstats
from io import StringIO

def load_data():
return np.random.random((1000, 1000))

def normalize(data):
return (data - data.mean()) / data.std()

def apply_filter(data):
kernel = np.array([[1, 0, -1], [2, 0, -2], [1, 0, -1]])
# Simulated 2D convolution
result = np.zeros_like(data)
for i in range(1, data.shape[0] - 1):
for j in range(1, data.shape[1] - 1):
result[i, j] = np.sum(data[i-1:i+2, j-1:j+2] * kernel)
return result

def pipeline():
data = load_data()
data = normalize(data)
data = apply_filter(data)
return data

pr = cProfile.Profile()
pr.enable()

for _ in range(5):
pipeline()

pr.disable()
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
ps.print_stats(10)
print(s.getvalue())

Output shows apply_filter as the bottleneck—candidate for Cython acceleration.

Key Takeaways

  • Use cProfile to identify which functions consume the most time; optimize those, not random code.
  • Establish a baseline with timeit or perf_counter() before and after optimization.
  • line_profiler shows which lines are hot; memory_profiler finds memory leaks.
  • Warm up Cython/Numba code (make one dummy call) before benchmarking to exclude compilation.
  • Document speedup as a ratio (10× faster) or percentage (90% reduction).

Frequently Asked Questions

Is cProfile accurate on modern CPUs with cache effects?

cProfile measures wall-clock time, which includes cache hits/misses. It's accurate for hotspots but not for cycle-level analysis. For that, use perf (Linux) or platform-specific profilers.

How do I profile code running in a thread?

cProfile doesn't thread. Use py-spy or Intel VTune for multithreaded profiling.

Can I profile Numba JIT compilation time separately from execution?

Yes. Measure warm-up time (first call) separately:

start = time.perf_counter()
result = numba_func(x) # Includes compilation
end = time.perf_counter()

start = time.perf_counter()
result = numba_func(x) # JIT cached
end = time.perf_counter()

What's a good speedup target?

It depends on context. For I/O-bound code, 1.5× is a win. For CPU-bound code, aim for 5–10×. Machine learning often targets 100×.

Should I always use profiling-enabled builds?

No. Profiling overhead (5–50%) skews measurements. Use profiling builds only during development; remove cython: profile=True for production.

Further Reading