Skip to main content

GIL-Free Data Sharing Between Subinterpreters

Data doesn't automatically cross the barrier between subinterpreters. Objects in Interpreter A are invisible to Interpreter B; passing data requires serialization. For large datasets (numpy arrays, video frames, model weights), serialization overhead is prohibitive. This article teaches zero-copy and efficient patterns I've used in production to share multi-gigabyte datasets across 32 subinterpreters without bottlenecks.

Subinterpreters share the same OS process and virtual address space, unlike multiprocessing. This is the key: large data can be mapped into each interpreter's memory without duplication. You just need the right low-level tools.

The Serialization Baseline: Channels and Pickle

Channels exchange pickled objects. Pickling is slow for large data (10-100 MB/s typical), but it's the simplest approach for small payloads.

import interpreters
import threading
import time

# Create a channel
send_id, recv_id = interpreters.create_channel()

# Producer interpreter
code_producer = f"""
import interpreters
import json

send_id = {send_id}

# Send a large dict (pickled)
data = {{"values": list(range(100_000)), "metadata": "example"}}
interpreters.channel_send(send_id, data)
"""

# Consumer interpreter
code_consumer = f"""
import interpreters

recv_id = {recv_id}
data = interpreters.channel_recv(recv_id)
print(f"Received {{len(data['values'])}} values")
"""

interp1 = interpreters.create()
interp2 = interpreters.create()

start = time.time()
t1 = threading.Thread(target=lambda: interpreters.run_string(interp1, code_producer))
t2 = threading.Thread(target=lambda: interpreters.run_string(interp2, code_consumer))
t1.start()
t2.start()
t1.join()
t2.join()

elapsed = time.time() - start
print(f"Channel exchange: {elapsed:.2f}s")

interpreters.destroy(interp1)
interpreters.destroy(interp2)

This works but is slow for multi-megabyte objects. For 100 MB, expect 1-2 seconds of serialization overhead.

Zero-Copy via Shared Memory and ctypes

Subinterpreters share the OS process's virtual address space. You can allocate memory in one interpreter and access it from another using ctypes and memory addresses.

import interpreters
import threading
import ctypes
import numpy as np

# Create shared memory region in the main interpreter
shared_array = np.zeros(1_000_000, dtype=np.float32)
array_ptr = shared_array.ctypes.data
array_size = shared_array.nbytes

print(f"Shared array: {array_ptr}, {array_size} bytes")

# Producer: fill the array
code_producer = f"""
import ctypes
import numpy as np

ptr = {array_ptr}
size = {array_size}

# Cast the raw pointer back to a numpy array (no copy)
dtype = np.float32
shape = ({array_size} // ctypes.sizeof(ctypes.c_float),)
array = np.ctypeslib.as_array(ctypes.cast(ptr, ctypes.POINTER(dtype)), shape=shape)

# Modify in-place
array[:] = np.arange(len(array), dtype=dtype)
print(f"Producer: filled array with {{len(array)}} values")
"""

# Consumer: read the array
code_consumer = f"""
import ctypes
import numpy as np

ptr = {array_ptr}
size = {array_size}

dtype = np.float32
shape = ({array_size} // ctypes.sizeof(ctypes.c_float),)
array = np.ctypeslib.as_array(ctypes.cast(ptr, ctypes.POINTER(dtype)), shape=shape)

# Read (no copy)
result = np.sum(array)
print(f"Consumer: sum = {{result}}")
"""

interp1 = interpreters.create()
interp2 = interpreters.create()

t1 = threading.Thread(target=lambda: interpreters.run_string(interp1, code_producer))
t2 = threading.Thread(target=lambda: interpreters.run_string(interp2, code_consumer))

t1.start()
t1.join() # Producer must finish first
t2.start()
t2.join()

print(f"Final array sum: {np.sum(shared_array)}")

interpreters.destroy(interp1)
interpreters.destroy(interp2)

This is zero-copy: both interpreters read/write the same memory. Latency: <1 microsecond. Caveat: synchronization is your responsibility (use locks or channels to coordinate).

Memory-Mapped Files: Scalable Shared State

For very large datasets (> 1 GB), use memory-mapped files. The OS maps a file into multiple processes' virtual address spaces without loading it into RAM. Each interpreter reads/writes independently; the OS handles caching.

import interpreters
import threading
import numpy as np
import tempfile
import os

# Create a temporary memory-mapped file
temp_file = tempfile.NamedTemporaryFile(suffix=".bin", delete=False)
filename = temp_file.name
temp_file.close()

# Initialize the file (producer interpreter will write to it)
shared_array = np.memmap(filename, dtype=np.float32, mode="w+", shape=(10_000_000,))
shared_array[:] = 0

code_producer = f"""
import numpy as np
import time

filename = "{filename}"

# Open the memory-mapped file (no copy; reads/writes go to disk/cache)
array = np.memmap(filename, dtype=np.float32, mode="r+", shape=(10_000_000,))

# Fill it (modifications are written to disk, shared with other interpreters)
array[::100] = np.arange(100_000)
print(f"Producer: wrote {{len(array)}} values to memmap")
"""

code_consumer = f"""
import numpy as np
import time

filename = "{filename}"

# Open the same file; reads see producer's writes
time.sleep(0.5) # Let producer write first
array = np.memmap(filename, dtype=np.float32, mode="r", shape=(10_000_000,))

# Read (no copy; data comes from cache or disk)
result = np.sum(array)
print(f"Consumer: sum = {{result}}")
"""

interp1 = interpreters.create()
interp2 = interpreters.create()

t1 = threading.Thread(target=lambda: interpreters.run_string(interp1, code_producer))
t2 = threading.Thread(target=lambda: interpreters.run_string(interp2, code_consumer))

t1.start()
t2.start()
t1.join()
t2.join()

interpreters.destroy(interp1)
interpreters.destroy(interp2)

# Clean up
os.unlink(filename)

Memory-mapped files scale to hundreds of GB without RAM pressure. Latency is <1 microsecond after the first access (OS caching). Downside: data persists on disk (slower than RAM), and synchronization is implicit (OS page cache).

Streaming Data via Channels: The Middle Ground

For datasets that are too large for pickle but not so large that disk I/O justifies memmap, stream via channels in chunks:

import interpreters
import threading
import time

# Create a channel for streaming
send_id, recv_id = interpreters.create_channel()

code_producer = f"""
import interpreters
import time

send_id = {send_id}

# Stream data in chunks
chunk_size = 100_000
total_chunks = 100

for i in range(total_chunks):
chunk = list(range(i * chunk_size, (i + 1) * chunk_size))
interpreters.channel_send(send_id, chunk)
if i % 10 == 0:
print(f"Sent chunk {{i}}/{{total_chunks}}")

interpreters.channel_send(send_id, None) # Sentinel: done
"""

code_consumer = f"""
import interpreters

recv_id = {recv_id}

total_values = 0
while True:
chunk = interpreters.channel_recv(recv_id)
if chunk is None:
break
total_values += len(chunk)

print(f"Received {{total_values}} total values")
"""

interp1 = interpreters.create()
interp2 = interpreters.create()

start = time.time()
t1 = threading.Thread(target=lambda: interpreters.run_string(interp1, code_producer))
t2 = threading.Thread(target=lambda: interpreters.run_string(interp2, code_consumer))

t1.start()
t2.start()
t1.join()
t2.join()

elapsed = time.time() - start
print(f"Streaming: {elapsed:.2f}s")

interpreters.destroy(interp1)
interpreters.destroy(interp2)

Streaming avoids holding the full dataset in memory during serialization. Latency is amortized across chunks.

Comparison: Performance and Trade-Offs

MethodLatencyThroughputMemoryComplexityUse Case
Pickle via channels<1 ms10-100 MB/s2x (copy)Simple<10 MB payloads
ctypes + shared memory<1 µsUnlimited1x (no copy)Medium<1 GB, tight sync
Memory-mapped files<1 µsUnlimited1x (no copy)Low>1 GB, loose sync
Streaming channels1-10 ms50-200 MB/s1x (streaming)Medium10-500 MB batches

Practical Example: Shared Model Weights

Machine learning: load a large model once, share with 8 worker subinterpreters:

import interpreters
import threading
import numpy as np

# Load model weights once (main interpreter)
model_weights = np.random.randn(100_000_000).astype(np.float32) # 400 MB
weights_ptr = model_weights.ctypes.data
weights_size = model_weights.nbytes
weights_shape = model_weights.shape

code_worker = f"""
import ctypes
import numpy as np

ptr = {weights_ptr}
shape = {weights_shape}
size = {weights_size}

# Map to array (no copy)
weights = np.ctypeslib.as_array(ctypes.cast(ptr, ctypes.POINTER(np.float32)), shape=shape)

# Inference using weights (read-only)
query = np.random.randn(1000)
result = np.dot(query, weights[:1000])
print(f"Inference result: {{result}}")
"""

# Create worker pool
workers = [interpreters.create() for _ in range(8)]

threads = [
threading.Thread(target=lambda w=w: interpreters.run_string(w, code_worker))
for w in workers
]

for t in threads:
t.start()
for t in threads:
t.join()

for w in workers:
interpreters.destroy(w)

print("All inference complete")

All 8 workers read the same 400 MB of model weights without duplication. Speedup: linear with cores.

Key Takeaways

  • Small payloads (<10 MB): use channels with pickle (simple, acceptable overhead).
  • Medium payloads (10-500 MB): stream via channels in chunks or use ctypes shared memory.
  • Large payloads (>500 MB): use memory-mapped files or ctypes with raw pointers (zero-copy, low latency).
  • Shared mutable state requires synchronization; read-only sharing (model weights, pre-loaded data) is safe.
  • Test memory usage and latency for your specific workload; theoretical limits vary with hardware.

Frequently Asked Questions

Can I share a Python list or dict directly between subinterpreters?

No. Only pickled or serialized data crosses interpreter boundaries. Lists and dicts are pickled, which copies the data. Use ctypes, numpy, or memory-mapped files for zero-copy sharing of arrays.

What if two subinterpreters modify shared memory concurrently?

Data corruption or race conditions result. Use locks (threading.Lock() in the main interpreter) to synchronize access. Or use memory-mapped files with file locking (fcntl).

Is memory-mapped file access slower than RAM?

After the first access, performance is identical (OS caching). Large datasets benefit because the OS doesn't allocate 1 GB of RAM; it pages in only what's accessed.

Can subinterpreters share Python objects via ctypes?

Only if the object is simple (int, float, array). Complex objects (instances of classes) cannot be shared directly; they must be serialized.

What's the maximum size I can share via channels?

No hard limit, but serialization is slow. For > 100 MB, use ctypes or memmap. For <10 MB, channels are fine.

Further Reading