Python Multiprocessing: Why True Parallelism Matters

Python's multiprocessing module solves the Global Interpreter Lock (GIL) problem by spawning independent OS processes, each with its own Python interpreter, so multiple cores can execute Python bytecode in true parallel. Threading in Python cannot achieve this: the GIL serializes bytecode execution across all threads in a single process, so threads are useful only for I/O-bound work (network, disk). Multiprocessing is the correct tool for CPU-intensive work like matrix math, image processing, and data transformation because it removes the GIL bottleneck entirely and lets you saturate multi-core hardware.

What Is the Global Interpreter Lock (GIL)?

The GIL is a mutex that protects Python's internal memory management in CPython. Python's reference counting system is not thread-safe, so CPython enforces a single lock that ensures only one thread can execute Python bytecode at any moment, even on multi-core systems. This lock is released during I/O operations (network calls, file reads) so threads block naturally while waiting, but for CPU-bound operations—pure Python loops, list processing, numeric computation—the GIL forces threads to take turns, eliminating parallelism.

Example: A CPU-bound task on a 4-core system with 4 threads will actually run slower than 1 thread because threads compete for the GIL, adding overhead. A single-threaded version does the same work without lock contention.

Multiprocessing Bypasses the GIL

When you spawn a process using multiprocessing.Process(), Python launches a brand-new OS process with a fresh Python interpreter instance. Each process has its own GIL, memory heap, and bytecode execution state. Because they are separate OS processes, the operating system's scheduler can run them on different cores in true parallel, and their GILs do not interfere.

Practical impact: On a 4-core CPU, you can now achieve near-4x speedup for pure CPU-bound work (minus inter-process overhead). Scaling is nearly linear up to the number of physical cores.

Process vs. Thread: Side-by-Side Comparison

Aspect	Thread	Process
Memory model	Shared heap; threads see same objects	Isolated heap; each process has its own memory
GIL behavior	Single GIL serializes all threads	Each process has own GIL; no contention
CPU parallelism	No—GIL forces time-sharing on single core	Yes—OS scheduler runs on multiple cores simultaneously
Startup cost	Fast: ~1 ms	Slower: ~50–200 ms (new interpreter instance)
Memory per worker	~1–5 MB thread overhead	~10–50 MB per process
Best for	I/O-bound (network, disk, database queries)	CPU-bound (math, image processing, data transform)
Synchronization	Lock, RLock, Event, Condition	Lock, Queue, Pipe, Semaphore
IPC mechanism	Shared objects in memory	Queue, Pipe, shared memory (ctypes.Array)

When Multiprocessing Scales and When It Doesn't

Multiprocessing scales well for:

Embarrassingly parallel tasks (each chunk is independent): batch image resizing, Monte Carlo simulations, data aggregation over large datasets.
CPU-bound loops: numerical algorithms, matrix operations, cryptography.
Tasks where you need to burst—spawn many workers, let them finish, then exit (short-lived processes).

Multiprocessing does NOT scale well for:

Fine-grained parallelism (millions of tiny tasks): the process startup (~100 ms) and IPC overhead exceed the per-task compute.
Frequent data exchange: if workers must constantly share large objects, the serialization (pickling) and copy overhead kills performance.
Latency-sensitive real-time systems: process spawning is slower than thread spawning.

Code Comparison: Threading vs. Multiprocessing

Here's a CPU-bound task (computing sum of squares) on a 4-core system:

import threading
import time
import multiprocessing

def cpu_bound_task(n):
    """Compute sum of squares—pure CPU work."""
    return sum(i ** 2 for i in range(n))

# THREADING: GIL limits parallelism
def threading_approach():
    threads = []
    results = []
    
    def worker():
        results.append(cpu_bound_task(10_000_000))
    
    start = time.perf_counter()
    for _ in range(4):
        t = threading.Thread(target=worker)
        threads.append(t)
        t.start()
    
    for t in threads:
        t.join()
    
    elapsed = time.perf_counter() - start
    print(f"Threading (4 threads): {elapsed:.2f}s")

# MULTIPROCESSING: True parallelism
def multiprocessing_approach():
    def worker(n):
        return cpu_bound_task(n)
    
    start = time.perf_counter()
    
    with multiprocessing.Pool(4) as pool:
        results = pool.map(worker, [10_000_000] * 4)
    
    elapsed = time.perf_counter() - start
    print(f"Multiprocessing (4 processes): {elapsed:.2f}s")

# BASELINE: Single-threaded
def single_threaded():
    start = time.perf_counter()
    for _ in range(4):
        cpu_bound_task(10_000_000)
    elapsed = time.perf_counter() - start
    print(f"Single-threaded (4 iterations): {elapsed:.2f}s")

if __name__ == "__main__":
    single_threaded()        # ~4.0s (baseline)
    threading_approach()     # ~4.2s (threads fight GIL)
    multiprocessing_approach()  # ~1.1s (4x speedup!)

On a 4-core CPU, you'll see multiprocessing run approximately 4 times faster than the single-threaded baseline, while threading offers almost no speedup because the GIL serializes the work.

Memory Overhead: The Trade-off

Each process costs memory. A minimal Python process occupies ~10–15 MB; with imports (NumPy, pandas) it grows to 30–100 MB. This means:

On a system with 8 GB RAM and 8 cores, you can comfortably spawn 50–80 worker processes.
If you need thousands of workers (e.g., handling thousands of concurrent requests), multiprocessing becomes wasteful; use asyncio instead.

Track memory per process with:

import os
import psutil

process = multiprocessing.Process(target=lambda: None)
process.start()

# Get process memory
p = psutil.Process(process.pid)
print(f"Memory: {p.memory_info().rss / 1024 / 1024:.1f} MB")

process.terminate()
process.join()

Key Takeaways

The GIL prevents Python threads from executing bytecode in parallel, making threads unsuitable for CPU-bound work.
Multiprocessing spawns independent processes, each with its own GIL, allowing true parallelism on multi-core systems.
Use multiprocessing for CPU-bound tasks (math, image processing, data transformation); achieve 2x–8x speedup on 4–8 core systems.
Use threading for I/O-bound tasks (network, disk, database); the GIL is released during I/O, so threads scale well.
Multiprocessing has higher memory overhead (~10–50 MB per process) and slower startup (~50–200 ms), so it's best for coarse-grained parallelism, not millions of tiny tasks.
Measure and profile: CPU-bound and I/O-bound tasks have different scaling characteristics.

Frequently Asked Questions

Can I run thousands of processes on a single machine?

No. Each process costs 10–50 MB of memory and takes 50–200 ms to spawn. On an 8 GB machine, you can comfortably run 100–200 processes; beyond that, memory and startup overhead dominate. For thousands of concurrent operations, use asyncio (lightweight coroutines) instead.

Does multiprocessing work on single-core systems?

Yes, but multiprocessing on a single core has context-switching overhead and no parallelism benefit—you'll actually be slower than single-threaded code. Multiprocessing is only worthwhile on multi-core systems.

Is multiprocessing compatible with NumPy and pandas?

Yes. NumPy and pandas release the GIL during vectorized operations, so threading already works well for NumPy. However, multiprocessing is still useful when you combine NumPy with pure Python loops or when you need to farm work across many independent processes.

What's the difference between multiprocessing.Pool and ProcessPoolExecutor?

Pool is older and does not return Future objects; ProcessPoolExecutor (from concurrent.futures) provides a cleaner Executor interface and integrates better with async code. For new code, prefer ProcessPoolExecutor.

Not directly—each process has isolated memory. To share data, use multiprocessing.Queue (thread-safe, serializes data), multiprocessing.Pipe (bidirectional, faster), or ctypes.Array (shared memory, low-level). See Article 5 and 6 in this series.

What Is the Global Interpreter Lock (GIL)?​

Multiprocessing Bypasses the GIL​

Process vs. Thread: Side-by-Side Comparison​

When Multiprocessing Scales and When It Doesn't​

Code Comparison: Threading vs. Multiprocessing​

Memory Overhead: The Trade-off​

Key Takeaways​

Frequently Asked Questions​

Can I run thousands of processes on a single machine?​

Does multiprocessing work on single-core systems?​

Is multiprocessing compatible with NumPy and pandas?​

What's the difference between multiprocessing.Pool and ProcessPoolExecutor?​

Can I share mutable objects across processes?​

Further Reading​