Skip to main content

Subinterpreters vs Threads: When to Use Each

Choosing between threads, processes, and subinterpreters shapes your architecture. The GIL made this simple: threads for I/O, processes for CPU work. Free-threaded Python and subinterpreters add a third dimension. I've architected systems using all three; this article distills when each is optimal.

Thread is a lightweight unit of execution sharing an address space and one GIL (on GIL-bound Python) or per-interpreter GIL (on free-threaded Python). A process is a fully isolated OS-level instance with its own memory and kernel resources. A subinterpreter is a Python-level isolated environment within one process, sharing the OS address space but with its own GIL and namespace. The trade-offs differ dramatically.

Quick Comparison Table

AspectThreadsProcessesSubinterpreters
Overhead<1 MB per thread~10-50 MB per process~1-5 MB per interpreter
Startup latency<1 ms50-500 ms<1 ms (after Python init)
Data sharingDirect (shared memory)Serialization (IPC)Channels (serialization)
IsolationNo (shared heap)Yes (separate OS process)Yes (separate Python namespace)
GIL contentionOn GIL-bound Python (bad)None (separate GILs)None (per-interpreter GILs)
Parallelism (CPU-bound, GIL)Single-threaded (1x core)N processes (Nx cores)N interpreters (Nx cores, free-threaded)
Parallelism (I/O-bound)N threads (N concurrent I/O)N processes (overhead)N interpreters (overhead)
Context switchesFast (shared heap)Slow (kernel, TLB flush)Medium (Python level)
DebuggingSimple (shared debugger)Complex (per-process gdb)Medium (inspect per interpreter)

I/O-Bound Work: Use Threads

Threads shine for I/O-bound tasks: network requests, file reads, database queries. The GIL releases during syscalls, so hundreds of threads can wait on I/O concurrently without blocking others.

Example: web scraper fetching 100 URLs in parallel.

import threading
import requests
from concurrent.futures import ThreadPoolExecutor

urls = [
f"https://jsonplaceholder.typicode.com/posts/{i}"
for i in range(1, 101)
]

def fetch_url(url):
"""Fetch a URL and return its size."""
try:
resp = requests.get(url, timeout=5)
return len(resp.content)
except Exception as e:
print(f"Error fetching {url}: {e}")
return 0

# ThreadPoolExecutor is the pythonic way
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(fetch_url, urls))

total_size = sum(results)
print(f"Fetched {len(urls)} URLs, total size: {total_size} bytes")

This works on any Python version, GIL or free-threaded. The GIL doesn't bottleneck because threads release it during network I/O. Ten threads waiting on network calls don't compete for the GIL.

Use threads for:

  • Network I/O (HTTP, WebSocket, gRPC).
  • File I/O (disk reads/writes).
  • Database queries (psycopg2, PyMySQL—these release the GIL during blocking calls).
  • Any async-like pattern with concurrent.futures.ThreadPoolExecutor.

CPU-Bound Work: GIL-Bound Python = Processes; Free-Threaded = Threads or Subinterpreters

On GIL-bound Python, threads serialize for CPU work. You have two options: multiprocessing or limit to single-threaded.

import multiprocessing
import time

def fibonacci(n):
"""Compute Fibonacci(n) recursively (CPU-bound)."""
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)

if __name__ == "__main__":
# Multiprocessing (one process per core)
with multiprocessing.Pool(processes=4) as pool:
start = time.time()
results = pool.map(fibonacci, [35] * 4)
elapsed = time.time() - start
print(f"Multiprocessing (4 cores): {elapsed:.2f}s")

On free-threaded Python, threads parallelize CPU work. Choose based on overhead:

import threading
import time

def fibonacci(n):
"""CPU-bound computation."""
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)

# Threads on free-threaded Python (no GIL)
start = time.time()
threads = []
for _ in range(4):
t = threading.Thread(target=fibonacci, args=(35,))
threads.append(t)
t.start()
for t in threads:
t.join()
elapsed = time.time() - start
print(f"Threads (free-threaded, 4 cores): {elapsed:.2f}s")

Free-threaded threads offer:

  • Lower overhead (~0.5 MB vs ~20 MB per process).
  • Faster startup (~1 ms vs ~100 ms per process).
  • Shared address space (useful for read-only data: code, pre-loaded models).

Subinterpreters offer the same benefits plus isolation:

import interpreters
import threading
import time

code_template = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)

result = fibonacci(35)
"""

# Create 4 subinterpreters
interps = [interpreters.create() for _ in range(4)]

start = time.time()
threads = []
for interp in interps:
def run_in_interp(i):
interpreters.run_string(i, code_template)
t = threading.Thread(target=run_in_interp, args=(interp,))
threads.append(t)
t.start()

for t in threads:
t.join()
elapsed = time.time() - start
print(f"Subinterpreters (4 cores): {elapsed:.2f}s")

# Clean up
for interp in interps:
interpreters.destroy(interp)

When to Use Subinterpreters

Subinterpreters are best when you need isolation + low overhead. Scenarios:

  1. Multi-tenant systems: Each tenant's code runs in its own interpreter; one bug or infinite loop doesn't crash others.

  2. Batch job pooling: Pre-create a pool of workers, pre-load heavy libraries (NumPy, TensorFlow), and dispatch tasks to avoid startup overhead.

  3. Sandboxed plugins: Run user-supplied code safely in an interpreter that you can destroy or limit.

  4. Mixed I/O and CPU work: A subinterpreter thread doing both network requests (I/O) and image processing (CPU) avoids GIL context switching that threads would have.

Example: plugin system.

import interpreters
import threading

def create_plugin_env(name):
"""Create an isolated environment for a plugin."""
interp = interpreters.create()

# Pre-load common utilities
interpreters.run_string(interp, """
import json
import time

loaded_plugins = []
""")

return interp

def load_plugin(interp, plugin_code):
"""Load and run user plugin code in isolation."""
try:
interpreters.run_string(interp, plugin_code)
return True
except RuntimeError as e:
print(f"Plugin error: {e}")
return False

# User-supplied plugin (potentially buggy)
plugin = """
def process_data(data):
return json.dumps({"input": data, "timestamp": time.time()})

result = process_data({"x": 1, "y": 2})
print(f"Plugin result: {result}")
"""

# Run in isolated interpreter
interp = create_plugin_env("user_plugin")
load_plugin(interp, plugin)
interpreters.destroy(interp)

Decision Tree: Which to Use?

  1. Is the work I/O-bound (network, disk, DB)? → Use threads (simplest, lowest latency).
  2. Is the work CPU-bound?
    • Running on GIL-bound Python? → Use processes (multiprocessing.Pool).
    • Running on free-threaded Python?
      • Need isolation (plugins, multi-tenant)? → Use subinterpreters.
      • Just need parallelism? → Use threads (simpler, lower overhead).
  3. Do I need a worker pool (preload models, avoid startup)? → Use subinterpreters with interpreters.create() at startup.
  4. Do I need to limit resource usage or kill long-running tasks? → Use processes (OS-level limits) or subinterpreters (Python-level limit, but not as strong).

Memory and Startup Comparison

Benchmark on a typical machine:

ApproachStartup TimeMemory per Worker
10 threads<1 ms0.5 MB
10 processes~500 ms25 MB
10 subinterpreters~50 ms (if pre-created)3 MB

Startup time dominates for short-lived tasks. Memory is a concern for large pools.

Recommendation Summary

  • Web servers (HTTP requests): Use asyncio or threads with ThreadPoolExecutor. Threads are simpler; asyncio is more scalable.
  • Data processing (CPU-heavy): Use multiprocessing.Pool (GIL-bound) or threads (free-threaded).
  • Real-time services (latency-critical): Prefer free-threaded threads over processes (500 ms startup overhead is too much).
  • Batch jobs (throughput-critical): Multiprocessing or subinterpreter pools; startup cost is amortized.
  • Sandboxing (security): Use subinterpreters if you control the code, separate processes if you don't trust it.

Key Takeaways

  • Threads: low overhead, best for I/O; GIL limits CPU parallelism on GIL-bound Python.
  • Processes: high overhead, true isolation, mandatory for CPU work on GIL-bound Python.
  • Subinterpreters: medium overhead, isolation + shared address space, best for worker pools and sandboxing on free-threaded Python.
  • Choose based on workload type (I/O vs CPU), isolation needs, and startup latency requirements.

Frequently Asked Questions

Can I mix threads and processes in one app?

Yes. Use multiprocessing to spawn worker processes, and within each process, use threading for I/O-bound tasks. Common in request handlers that do both network calls and spawning subprocess utilities.

Why would I use subinterpreters instead of processes?

Subinterpreters offer lower overhead (startup, memory) and shared address space (useful for caching, pre-loaded models). Use processes if you need OS-level isolation (security, resource limits) or if running untrusted code.

Does asyncio make threads obsolete?

No. asyncio is excellent for I/O-bound work in a single thread. Threads are better for CPU-bound I/O (e.g., blocking database drivers that don't cooperate with asyncio). Use whichever fits your libraries and code style.

What if my code uses both CPU-bound and I/O-bound work?

Use threads for I/O; spawn a thread pool for CPU work if using free-threaded Python. If using GIL-bound Python, spawn a multiprocessing pool for CPU work. Libraries like ray automate this decision.

Is free-threaded Python production-ready in 2026?

Yes, Python 3.13+ free-threaded builds are stable. Most major libraries (NumPy, Pandas, Torch) have free-threaded wheels. Test in staging first; some C extensions may not have free-threaded support yet.

Further Reading