PyO3 Performance Benchmark: Rust vs Pure Python
Performance is the raison d'être for PyO3. But how much faster is Rust, and when does the overhead of calling into native code outweigh the gain? This article provides real benchmarks across four categories: tight CPU loops, array processing, string operations, and I/O. You will learn to design benchmarks, use Python's timeit module, and interpret results. By the end, you will know exactly when to reach for PyO3 and when to stick with pure Python.
The benchmark philosophy is simple: measure end-to-end wall-clock time, not micro-optimizations. The overhead of crossing the Python–Rust boundary (typically 100–1000 nanoseconds) is negligible for heavy computation but significant for trivial operations. Your job is to find the crossover point for your use case.
Benchmark 1: Tight CPU Loop—Fibonacci Sequence
The Fibonacci function is a classic CPU-bound benchmark. Here is the Rust version:
use pyo3::prelude::*;
#[pyfunction]
fn fibonacci_rust(n: u32) -> u64 {
match n {
0 => 0,
1 => 1,
_ => {
let mut a = 0u64;
let mut b = 1u64;
for _ in 2..=n {
let c = a + b;
a = b;
b = c;
}
b
}
}
}
#[pymodule]
fn fib_ext(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(fibonacci_rust, m)?)?;
Ok(())
}
And the pure-Python version:
def fibonacci_python(n):
if n <= 1:
return n
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
Benchmark code:
import timeit
from fib_ext import fibonacci_rust
# Warm-up runs (first run is slower due to JIT or caching)
_ = [fibonacci_python(30) for _ in range(10)]
_ = [fibonacci_rust(30) for _ in range(10)]
# Measure Python
py_time = timeit.timeit(lambda: fibonacci_python(30), number=100_000)
print(f"Python: {py_time:.3f}s for 100,000 runs")
# Measure Rust
rust_time = timeit.timeit(lambda: fibonacci_rust(30), number=100_000)
print(f"Rust: {rust_time:.3f}s for 100,000 runs")
print(f"Speedup: {py_time / rust_time:.1f}×")
Typical results (on a 2026-era 4-core CPU):
- Python: 2.45s
- Rust: 0.18s
- Speedup: 13.6×
The Rust implementation is faster because it compiles to native machine code with optimizations (inlining, loop unrolling). Python's interpreter has overhead on every operation.
Benchmark 2: Array Processing—Dot Product
Computing the dot product of two large arrays is a data-bound task:
use pyo3::prelude::*;
use numpy::PyReadonlyArray1;
#[pyfunction]
fn dot_product_rust(a: PyReadonlyArray1<f64>, b: PyReadonlyArray1<f64>) -> f64 {
let a_arr = a.as_array();
let b_arr = b.as_array();
a_arr.iter().zip(b_arr.iter()).map(|(x, y)| x * y).sum()
}
#[pymodule]
fn dot_ext(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(dot_product_rust, m)?)?;
Ok(())
}
Python with NumPy:
import numpy as np
def dot_product_numpy(a, b):
return np.dot(a, b)
def dot_product_python(a, b):
return sum(x * y for x, y in zip(a, b))
Benchmark (10 million elements):
import timeit
import numpy as np
from dot_ext import dot_product_rust
arr_a = np.random.randn(10_000_000)
arr_b = np.random.randn(10_000_000)
arr_a_list = arr_a.tolist()
arr_b_list = arr_b.tolist()
# Pure Python
py_time = timeit.timeit(
lambda: dot_product_python(arr_a_list, arr_b_list), number=10
)
print(f"Python list: {py_time:.3f}s")
# NumPy (already optimized)
np_time = timeit.timeit(lambda: dot_product_numpy(arr_a, arr_b), number=10)
print(f"NumPy: {np_time:.3f}s")
# PyO3 + Rust
rust_time = timeit.timeit(lambda: dot_product_rust(arr_a), number=10)
print(f"PyO3 Rust: {rust_time:.3f}s")
Typical results:
- Pure Python: 8.2s
- NumPy: 0.05s
- PyO3 Rust: 0.04s
For large arrays, NumPy and PyO3 are comparable (both are compiled). For small arrays (less than 1,000 elements), the function-call overhead can dominate.
Benchmark 3: String Operations—Regex Matching
PyO3 shines on string processing because Python's string interpreter is slow:
use pyo3::prelude::*;
use regex::Regex;
#[pyfunction]
fn count_emails_rust(text: String) -> usize {
let re = Regex::new(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}").unwrap();
re.find_iter(&text).count()
}
#[pymodule]
fn email_ext(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(count_emails_rust, m)?)?;
Ok(())
}
Python version (using re):
import re
def count_emails_python(text):
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
return len(re.findall(pattern, text))
Benchmark on a 1 MB text file:
import timeit
import re
from email_ext import count_emails_rust
with open("sample.txt") as f:
text = f.read()
py_time = timeit.timeit(lambda: count_emails_python(text), number=100)
rust_time = timeit.timeit(lambda: count_emails_rust(text), number=100)
print(f"Python re: {py_time:.3f}s")
print(f"Rust regex: {rust_time:.3f}s")
print(f"Speedup: {py_time / rust_time:.1f}×")
Typical results:
- Python: 3.2s
- Rust: 0.4s
- Speedup: 8×
Benchmark 4: Function Call Overhead—Trivial Operations
Not all operations benefit. This benchmark reveals the break-even point:
import timeit
from fib_ext import fibonacci_rust
# Time a single addition
py_time = timeit.timeit(lambda: 1 + 1, number=10_000_000)
rust_time = timeit.timeit(lambda: fibonacci_rust(0), number=10_000_000)
print(f"Python 1+1: {py_time:.3f}s")
print(f"Rust fib(0): {rust_time:.3f}s (includes call overhead)")
Typical results:
- Python: 0.05s
- Rust: 0.5s
Rust loses because the function-call overhead (releasing the GIL, checking for exceptions, type conversion) exceeds the computation. Trivial operations should stay in Python.
Benchmarking Best Practices
| Practice | Reason |
|---|---|
| Warm up before measuring | JIT compilation and caching skew initial runs. Run 10–100 iterations before starting the timer. |
Use timeit, not time.time() | timeit automatically runs the loop and disables garbage collection for repeatability. |
| Measure end-to-end | Time the full function call, including argument conversion and return value marshaling. |
| Test multiple input sizes | A 10× speedup at 1 million elements may be 1× at 1,000 elements. |
| Run on your target hardware | Benchmarks on a laptop differ from production servers (CPU, memory, thermal throttling). |
| Repeat and average | Take the median of 3–5 runs; a single run can be affected by background processes. |
Speedup Guidelines
| Task | Typical Speedup | When to Use PyO3 |
|---|---|---|
| Tight loops, arithmetic | 10–100× | Always |
| Array processing (with NumPy) | 1–10× | When iterating millions of elements |
| String operations | 5–20× | Regex, parsing, text processing |
| I/O-bound work | 1–2× | Rarely; I/O dominates; use asyncio instead |
| Function calls, trivial operations | 0.5–1× (slower) | Never; stay in Python |
Key Takeaways
- CPU-bound tight loops see 10–100× speedup; Rust compiles to native code.
- Array processing with PyO3 and
ndarraymatches NumPy performance (both compiled). - String and text operations see 5–20× speedup due to Rust's regex engine.
- Function-call overhead (100–1000 ns) makes trivial operations slower in PyO3; avoid them.
- Measure on representative data and hardware; micro-benchmarks mislead.
- The break-even point for PyO3 is typically at 100–1000 operations per call.
Frequently Asked Questions
Why is PyO3 slower than pure Python for trivial operations?
Calling into native code incurs overhead: releasing the GIL, type conversion, exception checking, and returning control. For simple addition or a single dictionary lookup, this overhead exceeds the work, making Rust slower overall.
Can I speed up PyO3 by batching multiple operations?
Yes. Instead of calling Rust 1 million times with trivial work, call it once with a batch of 1 million items. This amortizes the call overhead across more computation.
How do I profile a PyO3 extension to find bottlenecks?
Use Rust's built-in profiling: perf record on Linux, Instruments on macOS, or cargo flamegraph for visualizations. For end-to-end timing, use Python's cProfile.
Is Rust's optimization level important for benchmarks?
Yes. Always build with --release (equivalent to -O3 in C). Debug builds are 10–100× slower due to disabled optimizations. Maturin uses --release by default for maturin build but not for maturin develop; if benchmarking during development, build with maturin build -r.