Building Thread-Safe Code for Free-Threaded Python
Free-threaded Python executes multiple threads truly in parallel. True parallelism exposes race conditions in code that was "safe" on GIL-bound Python only by accident (because the GIL serialized execution). I've debugged production outages caused by race conditions that never manifested under the GIL; this article teaches the patterns that keep your code safe as it scales.
Thread-safety is about preventing data corruption when multiple threads access shared state concurrently. On free-threaded Python, any shared mutable object is vulnerable. Protect it with locks, atomic operations, or by avoiding sharing altogether.
The Problem: Race Conditions
Threads execute interleaved at arbitrary points. Consider a shared counter:
import threading
import time
counter = 0
def increment():
"""Increment the counter 100,000 times."""
global counter
for _ in range(100_000):
counter += 1
# Run two threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()
print(f"Counter: {counter}")
# Expected: 200,000
# Actual: ~130,000 (varies due to race condition)
Why? The operation counter += 1 compiles to three bytecode instructions:
LOAD_GLOBAL counter
LOAD_CONST 1
BINARY_ADD
STORE_GLOBAL counter
If Thread A reads counter (50), then Thread B reads counter (50), both add 1, and both write 51, the increment is lost. The counter should be 52; instead, it's 51.
On GIL-bound Python, the GIL serializes bytecode execution, so this race condition never occurs in practice. On free-threaded Python, it does. Test it:
# Run the above script on free-threaded Python 3.13
# Result: counter is NOT 200,000; it's a lower value
Solution 1: Locks (Most Common)
A Lock is a mutual exclusion primitive. Only one thread can hold it at a time. Use threading.Lock() and a context manager (with lock:) to protect shared state:
import threading
counter = 0
lock = threading.Lock()
def increment():
"""Increment the counter 100,000 times, safely."""
global counter
for _ in range(100_000):
with lock: # Acquire lock before modifying
counter += 1
# Run two threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start()
t2.start()
t1.join()
t2.join()
print(f"Counter: {counter}")
# Now: 200,000 (correct)
The lock ensures only one thread modifies counter at a time. Performance cost: lock acquisition (~100-500 nanoseconds) and potential blocking (one thread waits for another). Minimize critical sections:
import threading
import time
shared_data = {"count": 0, "total": 0}
lock = threading.Lock()
def process():
"""Do some work, then update shared data."""
# Heavy computation (no lock needed)
result = sum(range(1_000_000))
# Update shared data atomically
with lock:
shared_data["count"] += 1
shared_data["total"] += result
threads = [threading.Thread(target=process) for _ in range(4)]
for t in threads:
t.start()
for t in threads:
t.join()
print(shared_data)
This scales better: threads only contend on the lock during the update (with lock: block), not during computation.
Solution 2: Thread-Safe Data Structures
The queue.Queue class is designed for thread-safe producer-consumer patterns. Multiple threads can safely push and pop without explicit locks:
import queue
import threading
# Thread-safe queue
q = queue.Queue(maxsize=10)
def producer():
"""Put items in the queue."""
for i in range(100):
q.put(f"Item {i}")
print(f"Produced: Item {i}")
def consumer():
"""Get items from the queue."""
while True:
item = q.get() # Blocks until item available
if item is None:
break # Sentinel value; stop
print(f"Consumed: {item}")
q.task_done()
# Start threads
t_prod = threading.Thread(target=producer)
t_cons = [threading.Thread(target=consumer) for _ in range(2)]
t_prod.start()
for t in t_cons:
t.start()
# Wait for queue to drain
q.join()
# Send sentinel values to stop consumers
for _ in t_cons:
q.put(None)
t_prod.join()
for t in t_cons:
t.join()
print("All done")
The Queue handles synchronization internally. No explicit locks needed; the API prevents you from shooting yourself in the foot.
Solution 3: Atomic Operations and Immutability
Some operations are atomic (indivisible) by nature. Assigning to a dict or list is atomic if the dict/list itself doesn't shrink or grow:
import threading
# Atomic: assigning to a dictionary value (not the dict structure)
config = {"x": 1, "y": 2}
lock = threading.Lock()
def update_config(key, value):
"""Update config atomically."""
# This is atomic; no lock needed (as long as dict structure doesn't change)
config[key] = value
# The above is safe only if keys are predetermined. If threads add new keys:
# (dict rehashing is not atomic), use a lock.
# Better: immutability
import collections
ConfigTuple = collections.namedtuple("Config", ["x", "y"])
config = ConfigTuple(1, 2)
lock = threading.Lock()
def update_config_immutable(key, value):
"""Update config safely (immutable)."""
global config
with lock:
# Create a new tuple with the update
new_config = config._replace(**{key: value})
config = new_config
t1 = threading.Thread(target=update_config_immutable, args=("x", 10))
t2 = threading.Thread(target=update_config_immutable, args=("y", 20))
t1.start()
t2.start()
t1.join()
t2.join()
print(config) # ConfigTuple(x=10, y=20)
Immutability eliminates races: threads read without locks, and updates are atomic (one writer at a time).
Solution 4: Thread-Local Storage
For data that should not be shared, use thread-local storage (threading.local()):
import threading
# Each thread gets its own copy
thread_local = threading.local()
def worker(name):
"""Each thread has its own counter (not shared)."""
thread_local.counter = 0
for _ in range(1000):
thread_local.counter += 1
print(f"{name}: counter = {thread_local.counter}")
threads = [threading.Thread(target=worker, args=(f"T{i}",)) for i in range(4)]
for t in threads:
t.start()
for t in threads:
t.join()
# Each thread printed 1000 (correct), no races
Thread-local storage is useful for request context in web frameworks (each request runs on its own thread and shouldn't see other requests' data).
Solution 5: Read-Write Locks (Advanced)
For read-heavy workloads, a read-write lock allows multiple readers concurrently but exclusive writes. Python 3.13+ has threading.RWLock():
import threading
rw_lock = threading.RWLock()
data = {"count": 0}
def reader():
"""Read data (many threads can do this concurrently)."""
with rw_lock.read_lock():
print(f"Read: {data}")
def writer():
"""Write data (exclusive access)."""
with rw_lock.write_lock():
data["count"] += 1
# Many readers can run in parallel
readers = [threading.Thread(target=reader) for _ in range(10)]
writers = [threading.Thread(target=writer) for _ in range(2)]
for t in readers + writers:
t.start()
for t in readers + writers:
t.join()
Read-write locks scale better for read-heavy workloads (e.g., caching layers where reads vastly outnumber writes).
Pattern: Double-Checked Locking (Lazy Initialization)
Initialize shared state on first access without locking on every read:
import threading
class Singleton:
_instance = None
_lock = threading.Lock()
def __new__(cls):
# Check without lock (fast path)
if cls._instance is None:
# Only lock if necessary
with cls._lock:
# Double-check (another thread might have initialized)
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
# Safe, efficient
s1 = Singleton()
s2 = Singleton()
assert s1 is s2
Key Takeaways
- Shared mutable state requires synchronization on free-threaded Python.
- Use locks (
threading.Lock()) to protect critical sections. - Use thread-safe collections (
queue.Queue,queue.LifoQueue) for producer-consumer patterns. - Immutability and thread-local storage eliminate sharing and races.
- Minimize lock contention; lock only the critical path.
- Test multi-threaded code with
pytest-stressor similar to expose races.
Frequently Asked Questions
Do I need to worry about thread-safety on GIL-bound Python?
Yes. The GIL provides atomicity for bytecode operations, but not for Python-level operations. E.g., list.append() is atomic, but if x not in list: x.append() is not (the check and append are two separate bytecode sequences). Always use locks if multiple threads access mutable shared data.
Is threading.Lock() the same as threading.RLock()?
No. RLock() is a reentrant lock; the same thread can acquire it multiple times (useful for recursive functions). Lock() will deadlock if the same thread tries to acquire twice. Use Lock() by default; use RLock() only if needed.
Can I use a lock from a subinterpreter?
No. Locks are tied to the interpreter that created them. Use channels to communicate between subinterpreters instead.
What's the performance overhead of locks?
Uncontended locks cost ~100-500 nanoseconds (lock acquisition). Contended locks block, adding context-switch overhead (~1-10 microseconds). If lock contention is high, refactor to reduce critical sections or use thread-local storage.
Is atomic assignment (e.g., x = y) thread-safe?
Yes, assigning to a variable is atomic. But reading and then acting on it is not: if x: y = x + 1 is two operations; another thread can change x between the read and the addition. Use locks for compound operations.