Skip to main content

Building Thread-Safe Code for Free-Threaded Python

Free-threaded Python executes multiple threads truly in parallel. True parallelism exposes race conditions in code that was "safe" on GIL-bound Python only by accident (because the GIL serialized execution). I've debugged production outages caused by race conditions that never manifested under the GIL; this article teaches the patterns that keep your code safe as it scales.

Thread-safety is about preventing data corruption when multiple threads access shared state concurrently. On free-threaded Python, any shared mutable object is vulnerable. Protect it with locks, atomic operations, or by avoiding sharing altogether.

The Problem: Race Conditions

Threads execute interleaved at arbitrary points. Consider a shared counter:

import threading
import time

counter = 0

def increment():
"""Increment the counter 100,000 times."""
global counter
for _ in range(100_000):
counter += 1

# Run two threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)

t1.start()
t2.start()
t1.join()
t2.join()

print(f"Counter: {counter}")
# Expected: 200,000
# Actual: ~130,000 (varies due to race condition)

Why? The operation counter += 1 compiles to three bytecode instructions:

LOAD_GLOBAL counter
LOAD_CONST 1
BINARY_ADD
STORE_GLOBAL counter

If Thread A reads counter (50), then Thread B reads counter (50), both add 1, and both write 51, the increment is lost. The counter should be 52; instead, it's 51.

On GIL-bound Python, the GIL serializes bytecode execution, so this race condition never occurs in practice. On free-threaded Python, it does. Test it:

# Run the above script on free-threaded Python 3.13
# Result: counter is NOT 200,000; it's a lower value

Solution 1: Locks (Most Common)

A Lock is a mutual exclusion primitive. Only one thread can hold it at a time. Use threading.Lock() and a context manager (with lock:) to protect shared state:

import threading

counter = 0
lock = threading.Lock()

def increment():
"""Increment the counter 100,000 times, safely."""
global counter
for _ in range(100_000):
with lock: # Acquire lock before modifying
counter += 1

# Run two threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)

t1.start()
t2.start()
t1.join()
t2.join()

print(f"Counter: {counter}")
# Now: 200,000 (correct)

The lock ensures only one thread modifies counter at a time. Performance cost: lock acquisition (~100-500 nanoseconds) and potential blocking (one thread waits for another). Minimize critical sections:

import threading
import time

shared_data = {"count": 0, "total": 0}
lock = threading.Lock()

def process():
"""Do some work, then update shared data."""
# Heavy computation (no lock needed)
result = sum(range(1_000_000))

# Update shared data atomically
with lock:
shared_data["count"] += 1
shared_data["total"] += result

threads = [threading.Thread(target=process) for _ in range(4)]
for t in threads:
t.start()
for t in threads:
t.join()

print(shared_data)

This scales better: threads only contend on the lock during the update (with lock: block), not during computation.

Solution 2: Thread-Safe Data Structures

The queue.Queue class is designed for thread-safe producer-consumer patterns. Multiple threads can safely push and pop without explicit locks:

import queue
import threading

# Thread-safe queue
q = queue.Queue(maxsize=10)

def producer():
"""Put items in the queue."""
for i in range(100):
q.put(f"Item {i}")
print(f"Produced: Item {i}")

def consumer():
"""Get items from the queue."""
while True:
item = q.get() # Blocks until item available
if item is None:
break # Sentinel value; stop
print(f"Consumed: {item}")
q.task_done()

# Start threads
t_prod = threading.Thread(target=producer)
t_cons = [threading.Thread(target=consumer) for _ in range(2)]

t_prod.start()
for t in t_cons:
t.start()

# Wait for queue to drain
q.join()

# Send sentinel values to stop consumers
for _ in t_cons:
q.put(None)

t_prod.join()
for t in t_cons:
t.join()

print("All done")

The Queue handles synchronization internally. No explicit locks needed; the API prevents you from shooting yourself in the foot.

Solution 3: Atomic Operations and Immutability

Some operations are atomic (indivisible) by nature. Assigning to a dict or list is atomic if the dict/list itself doesn't shrink or grow:

import threading

# Atomic: assigning to a dictionary value (not the dict structure)
config = {"x": 1, "y": 2}
lock = threading.Lock()

def update_config(key, value):
"""Update config atomically."""
# This is atomic; no lock needed (as long as dict structure doesn't change)
config[key] = value

# The above is safe only if keys are predetermined. If threads add new keys:
# (dict rehashing is not atomic), use a lock.

# Better: immutability
import collections

ConfigTuple = collections.namedtuple("Config", ["x", "y"])

config = ConfigTuple(1, 2)
lock = threading.Lock()

def update_config_immutable(key, value):
"""Update config safely (immutable)."""
global config
with lock:
# Create a new tuple with the update
new_config = config._replace(**{key: value})
config = new_config

t1 = threading.Thread(target=update_config_immutable, args=("x", 10))
t2 = threading.Thread(target=update_config_immutable, args=("y", 20))

t1.start()
t2.start()
t1.join()
t2.join()

print(config) # ConfigTuple(x=10, y=20)

Immutability eliminates races: threads read without locks, and updates are atomic (one writer at a time).

Solution 4: Thread-Local Storage

For data that should not be shared, use thread-local storage (threading.local()):

import threading

# Each thread gets its own copy
thread_local = threading.local()

def worker(name):
"""Each thread has its own counter (not shared)."""
thread_local.counter = 0
for _ in range(1000):
thread_local.counter += 1
print(f"{name}: counter = {thread_local.counter}")

threads = [threading.Thread(target=worker, args=(f"T{i}",)) for i in range(4)]
for t in threads:
t.start()
for t in threads:
t.join()

# Each thread printed 1000 (correct), no races

Thread-local storage is useful for request context in web frameworks (each request runs on its own thread and shouldn't see other requests' data).

Solution 5: Read-Write Locks (Advanced)

For read-heavy workloads, a read-write lock allows multiple readers concurrently but exclusive writes. Python 3.13+ has threading.RWLock():

import threading

rw_lock = threading.RWLock()
data = {"count": 0}

def reader():
"""Read data (many threads can do this concurrently)."""
with rw_lock.read_lock():
print(f"Read: {data}")

def writer():
"""Write data (exclusive access)."""
with rw_lock.write_lock():
data["count"] += 1

# Many readers can run in parallel
readers = [threading.Thread(target=reader) for _ in range(10)]
writers = [threading.Thread(target=writer) for _ in range(2)]

for t in readers + writers:
t.start()
for t in readers + writers:
t.join()

Read-write locks scale better for read-heavy workloads (e.g., caching layers where reads vastly outnumber writes).

Pattern: Double-Checked Locking (Lazy Initialization)

Initialize shared state on first access without locking on every read:

import threading

class Singleton:
_instance = None
_lock = threading.Lock()

def __new__(cls):
# Check without lock (fast path)
if cls._instance is None:
# Only lock if necessary
with cls._lock:
# Double-check (another thread might have initialized)
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance

# Safe, efficient
s1 = Singleton()
s2 = Singleton()
assert s1 is s2

Key Takeaways

  • Shared mutable state requires synchronization on free-threaded Python.
  • Use locks (threading.Lock()) to protect critical sections.
  • Use thread-safe collections (queue.Queue, queue.LifoQueue) for producer-consumer patterns.
  • Immutability and thread-local storage eliminate sharing and races.
  • Minimize lock contention; lock only the critical path.
  • Test multi-threaded code with pytest-stress or similar to expose races.

Frequently Asked Questions

Do I need to worry about thread-safety on GIL-bound Python?

Yes. The GIL provides atomicity for bytecode operations, but not for Python-level operations. E.g., list.append() is atomic, but if x not in list: x.append() is not (the check and append are two separate bytecode sequences). Always use locks if multiple threads access mutable shared data.

Is threading.Lock() the same as threading.RLock()?

No. RLock() is a reentrant lock; the same thread can acquire it multiple times (useful for recursive functions). Lock() will deadlock if the same thread tries to acquire twice. Use Lock() by default; use RLock() only if needed.

Can I use a lock from a subinterpreter?

No. Locks are tied to the interpreter that created them. Use channels to communicate between subinterpreters instead.

What's the performance overhead of locks?

Uncontended locks cost ~100-500 nanoseconds (lock acquisition). Contended locks block, adding context-switch overhead (~1-10 microseconds). If lock contention is high, refactor to reduce critical sections or use thread-local storage.

Is atomic assignment (e.g., x = y) thread-safe?

Yes, assigning to a variable is atomic. But reading and then acting on it is not: if x: y = x + 1 is two operations; another thread can change x between the read and the addition. Use locks for compound operations.

Further Reading