Subinterpreters vs Threads: When to Use Each
Choosing between threads, processes, and subinterpreters shapes your architecture. The GIL made this simple: threads for I/O, processes for CPU work. Free-threaded Python and subinterpreters add a third dimension. I've architected systems using all three; this article distills when each is optimal.
Thread is a lightweight unit of execution sharing an address space and one GIL (on GIL-bound Python) or per-interpreter GIL (on free-threaded Python). A process is a fully isolated OS-level instance with its own memory and kernel resources. A subinterpreter is a Python-level isolated environment within one process, sharing the OS address space but with its own GIL and namespace. The trade-offs differ dramatically.
Quick Comparison Table
| Aspect | Threads | Processes | Subinterpreters |
|---|---|---|---|
| Overhead | <1 MB per thread | ~10-50 MB per process | ~1-5 MB per interpreter |
| Startup latency | <1 ms | 50-500 ms | <1 ms (after Python init) |
| Data sharing | Direct (shared memory) | Serialization (IPC) | Channels (serialization) |
| Isolation | No (shared heap) | Yes (separate OS process) | Yes (separate Python namespace) |
| GIL contention | On GIL-bound Python (bad) | None (separate GILs) | None (per-interpreter GILs) |
| Parallelism (CPU-bound, GIL) | Single-threaded (1x core) | N processes (Nx cores) | N interpreters (Nx cores, free-threaded) |
| Parallelism (I/O-bound) | N threads (N concurrent I/O) | N processes (overhead) | N interpreters (overhead) |
| Context switches | Fast (shared heap) | Slow (kernel, TLB flush) | Medium (Python level) |
| Debugging | Simple (shared debugger) | Complex (per-process gdb) | Medium (inspect per interpreter) |
I/O-Bound Work: Use Threads
Threads shine for I/O-bound tasks: network requests, file reads, database queries. The GIL releases during syscalls, so hundreds of threads can wait on I/O concurrently without blocking others.
Example: web scraper fetching 100 URLs in parallel.
import threading
import requests
from concurrent.futures import ThreadPoolExecutor
urls = [
f"https://jsonplaceholder.typicode.com/posts/{i}"
for i in range(1, 101)
]
def fetch_url(url):
"""Fetch a URL and return its size."""
try:
resp = requests.get(url, timeout=5)
return len(resp.content)
except Exception as e:
print(f"Error fetching {url}: {e}")
return 0
# ThreadPoolExecutor is the pythonic way
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(fetch_url, urls))
total_size = sum(results)
print(f"Fetched {len(urls)} URLs, total size: {total_size} bytes")
This works on any Python version, GIL or free-threaded. The GIL doesn't bottleneck because threads release it during network I/O. Ten threads waiting on network calls don't compete for the GIL.
Use threads for:
- Network I/O (HTTP, WebSocket, gRPC).
- File I/O (disk reads/writes).
- Database queries (psycopg2, PyMySQL—these release the GIL during blocking calls).
- Any async-like pattern with
concurrent.futures.ThreadPoolExecutor.
CPU-Bound Work: GIL-Bound Python = Processes; Free-Threaded = Threads or Subinterpreters
On GIL-bound Python, threads serialize for CPU work. You have two options: multiprocessing or limit to single-threaded.
import multiprocessing
import time
def fibonacci(n):
"""Compute Fibonacci(n) recursively (CPU-bound)."""
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
if __name__ == "__main__":
# Multiprocessing (one process per core)
with multiprocessing.Pool(processes=4) as pool:
start = time.time()
results = pool.map(fibonacci, [35] * 4)
elapsed = time.time() - start
print(f"Multiprocessing (4 cores): {elapsed:.2f}s")
On free-threaded Python, threads parallelize CPU work. Choose based on overhead:
import threading
import time
def fibonacci(n):
"""CPU-bound computation."""
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
# Threads on free-threaded Python (no GIL)
start = time.time()
threads = []
for _ in range(4):
t = threading.Thread(target=fibonacci, args=(35,))
threads.append(t)
t.start()
for t in threads:
t.join()
elapsed = time.time() - start
print(f"Threads (free-threaded, 4 cores): {elapsed:.2f}s")
Free-threaded threads offer:
- Lower overhead (~0.5 MB vs ~20 MB per process).
- Faster startup (~1 ms vs ~100 ms per process).
- Shared address space (useful for read-only data: code, pre-loaded models).
Subinterpreters offer the same benefits plus isolation:
import interpreters
import threading
import time
code_template = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
result = fibonacci(35)
"""
# Create 4 subinterpreters
interps = [interpreters.create() for _ in range(4)]
start = time.time()
threads = []
for interp in interps:
def run_in_interp(i):
interpreters.run_string(i, code_template)
t = threading.Thread(target=run_in_interp, args=(interp,))
threads.append(t)
t.start()
for t in threads:
t.join()
elapsed = time.time() - start
print(f"Subinterpreters (4 cores): {elapsed:.2f}s")
# Clean up
for interp in interps:
interpreters.destroy(interp)
When to Use Subinterpreters
Subinterpreters are best when you need isolation + low overhead. Scenarios:
-
Multi-tenant systems: Each tenant's code runs in its own interpreter; one bug or infinite loop doesn't crash others.
-
Batch job pooling: Pre-create a pool of workers, pre-load heavy libraries (NumPy, TensorFlow), and dispatch tasks to avoid startup overhead.
-
Sandboxed plugins: Run user-supplied code safely in an interpreter that you can destroy or limit.
-
Mixed I/O and CPU work: A subinterpreter thread doing both network requests (I/O) and image processing (CPU) avoids GIL context switching that threads would have.
Example: plugin system.
import interpreters
import threading
def create_plugin_env(name):
"""Create an isolated environment for a plugin."""
interp = interpreters.create()
# Pre-load common utilities
interpreters.run_string(interp, """
import json
import time
loaded_plugins = []
""")
return interp
def load_plugin(interp, plugin_code):
"""Load and run user plugin code in isolation."""
try:
interpreters.run_string(interp, plugin_code)
return True
except RuntimeError as e:
print(f"Plugin error: {e}")
return False
# User-supplied plugin (potentially buggy)
plugin = """
def process_data(data):
return json.dumps({"input": data, "timestamp": time.time()})
result = process_data({"x": 1, "y": 2})
print(f"Plugin result: {result}")
"""
# Run in isolated interpreter
interp = create_plugin_env("user_plugin")
load_plugin(interp, plugin)
interpreters.destroy(interp)
Decision Tree: Which to Use?
- Is the work I/O-bound (network, disk, DB)? → Use threads (simplest, lowest latency).
- Is the work CPU-bound?
- Running on GIL-bound Python? → Use processes (multiprocessing.Pool).
- Running on free-threaded Python?
- Need isolation (plugins, multi-tenant)? → Use subinterpreters.
- Just need parallelism? → Use threads (simpler, lower overhead).
- Do I need a worker pool (preload models, avoid startup)? → Use subinterpreters with
interpreters.create()at startup. - Do I need to limit resource usage or kill long-running tasks? → Use processes (OS-level limits) or subinterpreters (Python-level limit, but not as strong).
Memory and Startup Comparison
Benchmark on a typical machine:
| Approach | Startup Time | Memory per Worker |
|---|---|---|
| 10 threads | <1 ms | 0.5 MB |
| 10 processes | ~500 ms | 25 MB |
| 10 subinterpreters | ~50 ms (if pre-created) | 3 MB |
Startup time dominates for short-lived tasks. Memory is a concern for large pools.
Recommendation Summary
- Web servers (HTTP requests): Use
asyncioor threads withThreadPoolExecutor. Threads are simpler; asyncio is more scalable. - Data processing (CPU-heavy): Use
multiprocessing.Pool(GIL-bound) or threads (free-threaded). - Real-time services (latency-critical): Prefer free-threaded threads over processes (500 ms startup overhead is too much).
- Batch jobs (throughput-critical): Multiprocessing or subinterpreter pools; startup cost is amortized.
- Sandboxing (security): Use subinterpreters if you control the code, separate processes if you don't trust it.
Key Takeaways
- Threads: low overhead, best for I/O; GIL limits CPU parallelism on GIL-bound Python.
- Processes: high overhead, true isolation, mandatory for CPU work on GIL-bound Python.
- Subinterpreters: medium overhead, isolation + shared address space, best for worker pools and sandboxing on free-threaded Python.
- Choose based on workload type (I/O vs CPU), isolation needs, and startup latency requirements.
Frequently Asked Questions
Can I mix threads and processes in one app?
Yes. Use multiprocessing to spawn worker processes, and within each process, use threading for I/O-bound tasks. Common in request handlers that do both network calls and spawning subprocess utilities.
Why would I use subinterpreters instead of processes?
Subinterpreters offer lower overhead (startup, memory) and shared address space (useful for caching, pre-loaded models). Use processes if you need OS-level isolation (security, resource limits) or if running untrusted code.
Does asyncio make threads obsolete?
No. asyncio is excellent for I/O-bound work in a single thread. Threads are better for CPU-bound I/O (e.g., blocking database drivers that don't cooperate with asyncio). Use whichever fits your libraries and code style.
What if my code uses both CPU-bound and I/O-bound work?
Use threads for I/O; spawn a thread pool for CPU work if using free-threaded Python. If using GIL-bound Python, spawn a multiprocessing pool for CPU work. Libraries like ray automate this decision.
Is free-threaded Python production-ready in 2026?
Yes, Python 3.13+ free-threaded builds are stable. Most major libraries (NumPy, Pandas, Torch) have free-threaded wheels. Test in staging first; some C extensions may not have free-threaded support yet.