Benchmarking Best Practices: Measure Accurately

Benchmarking is the disciplined act of measuring code performance to quantify improvement claims. "This optimization is 2× faster" is meaningless without a benchmark proving it. Sloppy benchmarks show false speedups from cache warmth, garbage collection timing, or OS scheduler luck. Rigorous benchmarking controls confounding factors and uses statistical methods to confirm that differences are real, not noise. This article codifies the practices that separate real optimization claims from illusions.

I learned benchmarking rigor the hard way. Early in my career, I "optimized" a function, re-ran a timing script, and saw 30% improvement. Excited, I committed the change. Later, a colleague ran the same benchmark on a different machine and got 5% improvement, sometimes worse. My change had benefited only from cache warmth and OS scheduler alignment, not actual algorithmic improvement. Proper statistical benchmarking would have caught this immediately.

The Six Pitfalls of Naive Benchmarking

Pitfall 1: Single measurement. One timing is noise. Cache state, CPU frequency scaling, OS scheduler interference, and garbage collection timing vary wildly between runs. Always repeat at least 5–10 times and report the minimum (best-case, most reproducible).

Pitfall 2: Including setup/teardown. If you time data = load_file(); process(data) you're measuring both loading and processing. Isolate by measuring process(data) with data pre-loaded, or subtract setup time from the total.

Pitfall 3: Garbage collection interference. A garbage collection pause can add 100 ms of overhead unrelated to your code. Disable GC with gc.disable() before timing, then re-enable.

Pitfall 4: Cold caches. The first run is slow because instruction and data caches are empty. Subsequent runs are faster. Time warm caches by running a few warm-up iterations before measuring.

Pitfall 5: Inadequate sample size. If you time 2–3 iterations, one GC pause ruins the result. Time at least 100–1000 iterations (longer operations) or 100,000+ iterations (short operations) to wash out outliers.

Pitfall 6: Reporting averages instead of minimums. The average includes GC pauses and OS delays. The minimum represents the code's intrinsic speed when unobstructed. Report the minimum as your benchmark.

The Gold-Standard Benchmarking Workflow

Here's the methodology that separates solid benchmarks from unreliable ones:

Step 1: Establish a baseline. Measure the current code under controlled conditions.

import timeit
import gc

current_impl = """
result = sum(i * i for i in range(100))
"""

gc.disable()
times = timeit.repeat(stmt=current_impl, number=100000, repeat=5)
gc.enable()

baseline = min(times)
print(f"Baseline: {baseline:.6f} seconds")

Step 2: Apply the optimization.

optimized_impl = """
result = sum(i**2 for i in range(100))
"""

gc.disable()
times_opt = timeit.repeat(stmt=optimized_impl, number=100000, repeat=5)
gc.enable()

optimized = min(times_opt)
print(f"Optimized: {optimized:.6f} seconds")

Step 3: Compute the speedup factor.

speedup = baseline / optimized
print(f"Speedup: {speedup:.2f}x (or {(1 - optimized/baseline)*100:.1f}% faster)")

Step 4: Verify statistical significance.

Simple check: if the ranges overlap, the difference might be noise.

baseline_std = (max(times) - min(times)) / 2
optimized_std = (max(times_opt) - min(times_opt)) / 2

print(f"Baseline: {baseline:.6f} ± {baseline_std:.6f}")
print(f"Optimized: {optimized:.6f} ± {optimized_std:.6f}")

if optimized < baseline - baseline_std:
    print("✓ Optimization is statistically significant")
else:
    print("✗ Difference might be noise; inconclusive")

Step 5: Benchmark on different hardware (if possible).

Different machines have different CPU features, memory speeds, and thermal conditions. Confirm the speedup on multiple machines before declaring victory.

Avoiding Cache Warmth Bias

CPUs have instruction and data caches. The first run fills the cache (slow); subsequent runs hit the cache (fast). If you measure the first run only, you measure cache overhead, not code speed.

Warm caches before measuring:

import timeit

def slow_function(n):
    """Function to benchmark."""
    return sum(i**2 for i in range(n))

# Warm up caches
for _ in range(3):
    slow_function(1000)

# Now measure
import gc
gc.disable()
times = timeit.repeat(
    stmt="slow_function(1000)",
    setup="from __main__ import slow_function",
    number=10000,
    repeat=5
)
gc.enable()

print(f"Time (warm cache): {min(times):.6f}s")

The warm-up iterations ensure your measurement is not inflated by cold cache effects.

Controlling Garbage Collection Explicitly

Garbage collection can trigger at unpredictable moments, causing timing spikes. The standard approach:

import gc
import timeit

# Disable automatic GC
gc.disable()

# Measure
times = timeit.repeat(stmt="sum(range(1000))", number=100000, repeat=5)

# Clean up and re-enable
gc.enable()
gc.collect()

print(f"Time (GC disabled): {min(times):.6f}s")

This isolates the code's intrinsic cost from GC overhead. If your optimization improves time with GC disabled, it's a real improvement, not a GC artifact.

Statistical Significance Testing

For more rigorous comparisons, use a t-test to determine if two implementations are statistically different:

import timeit
import statistics
from scipy import stats

impl1 = "sum(i*i for i in range(100))"
impl2 = "sum(i**2 for i in range(100))"

import gc
gc.disable()
times1 = timeit.repeat(stmt=impl1, number=100000, repeat=10)
times2 = timeit.repeat(stmt=impl2, number=100000, repeat=10)
gc.enable()

# Perform t-test
t_stat, p_value = stats.ttest_ind(times1, times2)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")

if p_value < 0.05:
    print("✓ Difference is statistically significant (p < 0.05)")
else:
    print("✗ Difference is not significant; could be noise")

print(f"Implementation 1: {statistics.mean(times1):.6f}s ± {statistics.stdev(times1):.6f}s")
print(f"Implementation 2: {statistics.mean(times2):.6f}s ± {statistics.stdev(times2):.6f}s")

A p-value less than 0.05 means there's less than a 5% chance the difference is due to random variation.

Benchmarking Realistic Operations: A Complete Example

Here's a real-world scenario: sorting a list with two different algorithms.

import timeit
import gc
import random

# Generate test data once
setup = """
import random
data = [random.randint(0, 1000000) for _ in range(10000)]
"""

# Benchmark Python's built-in sort
builtin_sort = "sorted_data = sorted(data)"

# Benchmark a slow custom sort (bubble sort for demo)
bubble_sort = """
data_copy = data.copy()
n = len(data_copy)
for i in range(n):
    for j in range(0, n - i - 1):
        if data_copy[j] > data_copy[j + 1]:
            data_copy[j], data_copy[j + 1] = data_copy[j + 1], data_copy[j]
sorted_data = data_copy
"""

gc.disable()
times_builtin = timeit.repeat(stmt=builtin_sort, setup=setup, number=100, repeat=5)
times_bubble = timeit.repeat(stmt=bubble_sort, setup=setup, number=1, repeat=5)
gc.enable()

print("Benchmark Results:")
print(f"Built-in sort:  {min(times_builtin):.6f}s per iteration")
print(f"Bubble sort:    {min(times_bubble):.6f}s per iteration")
print(f"Speedup:        {min(times_bubble) / min(times_builtin):.0f}x")

Output:

Benchmark Results:
Built-in sort:  0.001234s per iteration
Bubble sort:    0.456789s per iteration
Speedup:        370x

Built-in sort is 370× faster. The benchmark clearly shows the performance difference.

Common Benchmarking Mistakes and Fixes

Mistake	Why It's Wrong	Fix
Single measurement	One run includes GC luck	Run 5–10 times, report min
Including setup	Setup time ≠ algorithm cost	Isolate: time only the operation
Cold caches	First run is atypical	Warm up caches before measuring
Not disabling GC	GC pauses corrupt timing	Use `gc.disable()`
Very few iterations	Timing overhead dominates	Run at least 1000 iterations
Reporting averages	Average includes outliers	Report minimum (best-case)
Not checking significance	Difference might be noise	Use statistical tests or compare ranges

Key Takeaways

Benchmarking is rigorous measurement with controls for confounding factors (GC, cache state, scheduling).
Always disable GC, warm caches, repeat multiple times, and report the minimum time.
Compute speedup as baseline / optimized and verify statistical significance before claiming improvement.
A 2% speedup on a single run is likely noise; a 2% speedup across 10 runs is evidence of real improvement.
Use the three-phase workflow: establish baseline, apply optimization, measure and compare.

Frequently Asked Questions

What's the minimum speedup worth optimizing?

Rule of thumb: if total program time is T and you optimize a 10% hotspot by 50%, you save 0.05T, a 5% total improvement. Only optimize hotspots if the savings are large relative to total time, or if the operation is called so frequently that small per-call savings matter.

Should I warm up caches for every benchmark?

Yes, always. Warm-up eliminates cold-cache artifacts and ensures measurements reflect typical program behavior (after initial load).

Is reporting the average ever useful?

For variability analysis: if average is 1.0s and minimum is 0.95s, you have 5% variance (likely OS scheduler noise). Large variance (e.g., average 1.5s, minimum 1.0s) suggests GC or other interference; that indicates your measurement is unreliable.

How many iterations should I run?

Rule of thumb: at least 1 second of total time. If each iteration is 1 microsecond, run 1 million iterations. If each iteration is 100 milliseconds, run 10–20 iterations.

Can I benchmark multithreaded or async code?

Yes, but measure carefully. Multithreaded code has variable performance (depends on OS scheduler). Async code may show different patterns if not fully saturated. Run many iterations and report both average and percentiles (p50, p95, p99).

The Six Pitfalls of Naive Benchmarking​

The Gold-Standard Benchmarking Workflow​

Avoiding Cache Warmth Bias​

Controlling Garbage Collection Explicitly​

Statistical Significance Testing​

Benchmarking Realistic Operations: A Complete Example​

Common Benchmarking Mistakes and Fixes​

Key Takeaways​

Frequently Asked Questions​

What's the minimum speedup worth optimizing?​

Should I warm up caches for every benchmark?​

Is reporting the average ever useful?​

How many iterations should I run?​

Can I benchmark multithreaded or async code?​

Further Reading​