Pickling Pitfalls: Debugging Serialization Errors
Multiprocessing must serialize arguments, return values, and queued objects using pickle. Many Python objects cannot be pickled: open file handles, lambdas, database connections, and objects with circular references. When you hit a pickling error, it's cryptic and hard to debug. This article catalogs unpicklable objects, shows how to diagnose them, and provides solutions from custom serialization to architectural refactoring.
Why Pickle is Necessary
Multiprocessing processes have isolated memory. To send an object from one process to another, you must convert it to bytes (pickling), transmit it via an OS pipe, and reconstruct it (unpickling) in the other process. Most Python objects are picklable, but some are not by design.
Common Unpicklable Objects
1. Open File Handles
import multiprocessing
def process_file(file_handle):
"""FAILS: file handles are not picklable."""
data = file_handle.read()
return len(data)
if __name__ == "__main__":
with open("data.txt", "r") as f:
p = multiprocessing.Process(target=process_file, args=(f,))
p.start()
p.join()
# TypeError: cannot pickle '_io.TextIOWrapper' object
Solution: Pass the filename, not the file object:
def process_file(filename):
"""OK: filename is a string (picklable)."""
with open(filename, "r") as f:
data = f.read()
return len(data)
if __name__ == "__main__":
p = multiprocessing.Process(target=process_file, args=("data.txt",))
p.start()
p.join()
2. Lambda Functions
import multiprocessing
if __name__ == "__main__":
pool = multiprocessing.Pool()
# FAILS: lambdas are not picklable
results = pool.map(lambda x: x ** 2, range(10))
# TypeError: cannot pickle 'function' object
Solution: Use a regular function or multiprocessing.Pool.apply_async with a callable class:
def square(x):
return x ** 2
if __name__ == "__main__":
pool = multiprocessing.Pool()
results = pool.map(square, range(10))
print(results)
Or use a callable class:
class Squarer:
def __call__(self, x):
return x ** 2
if __name__ == "__main__":
pool = multiprocessing.Pool()
squarer = Squarer()
results = pool.map(squarer, range(10))
3. Local Functions (Nested Functions)
import multiprocessing
def outer():
def inner(x):
return x ** 2
pool = multiprocessing.Pool()
results = pool.map(inner, range(10)) # FAILS
if __name__ == "__main__":
outer()
# AttributeError: Can't get attribute 'inner' on module
Solution: Define the function at module level:
def inner(x):
return x ** 2
if __name__ == "__main__":
pool = multiprocessing.Pool()
results = pool.map(inner, range(10))
4. Database Connections and Sockets
import multiprocessing
import sqlite3
def query_db(conn):
"""FAILS: SQLite connections are not picklable."""
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
return cursor.fetchall()
if __name__ == "__main__":
conn = sqlite3.connect(":memory:")
p = multiprocessing.Process(target=query_db, args=(conn,))
p.start()
p.join()
# TypeError: cannot pickle 'sqlite3.Connection' object
Solution: Pass connection parameters; let the worker open its own connection:
def query_db(db_file):
"""OK: open connection in worker."""
conn = sqlite3.connect(db_file)
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
return cursor.fetchall()
if __name__ == "__main__":
p = multiprocessing.Process(target=query_db, args=(":memory:",))
p.start()
p.join()
5. Circular References in Objects
import multiprocessing
import pickle
class Node:
def __init__(self, name):
self.name = name
self.parent = None
if __name__ == "__main__":
# Create circular reference
node_a = Node("A")
node_b = Node("B")
node_a.parent = node_b
node_b.parent = node_a # Circular
try:
pickle.dumps(node_a) # OK (pickle handles cycles)
except Exception as e:
print(f"Pickle failed: {e}")
# Multiprocessing passes to Pool
pool = multiprocessing.Pool()
# Usually works because pickle handles cycles, but deep cycles can fail
Pickle generally handles cycles well. If you hit issues, refactor to use IDs instead of direct object references.
6. Lambda Stored in a Class
import multiprocessing
class Config:
def __init__(self):
# FAILS: lambda attribute is not picklable
self.transform = lambda x: x ** 2
if __name__ == "__main__":
config = Config()
def worker(cfg):
return cfg.transform(10)
p = multiprocessing.Process(target=worker, args=(config,))
p.start()
p.join()
# TypeError: cannot pickle 'function' object
Solution: Store the lambda separately or use a method:
class Config:
def transform(self, x):
return x ** 2
if __name__ == "__main__":
config = Config()
def worker(cfg):
return cfg.transform(10)
p = multiprocessing.Process(target=worker, args=(config,))
p.start()
p.join() # OK
Diagnosing Pickling Errors
When you hit an error, test the object with pickle directly:
import pickle
# Test if an object is picklable
obj = some_object
try:
serialized = pickle.dumps(obj)
deserialized = pickle.loads(serialized)
print("OK: object is picklable")
except Exception as e:
print(f"NOT picklable: {e}")
For complex objects, use pickletools to inspect:
import pickle
import pickletools
import io
obj = {"data": [1, 2, 3], "nested": {"key": "value"}}
serialized = pickle.dumps(obj)
# Disassemble the pickle bytecode
pickletools.dis(io.BytesIO(serialized))
Custom Serialization: getstate and setstate
For objects that are not directly picklable, implement custom serialization:
import pickle
import multiprocessing
class DatabaseConnection:
def __init__(self, host, port):
self.host = host
self.port = port
self.conn = None # Not picklable
def connect(self):
"""Establish connection (called after unpickling)."""
self.conn = f"Connected to {self.host}:{self.port}"
def __getstate__(self):
"""Return picklable state (exclude conn)."""
state = self.__dict__.copy()
del state['conn'] # Remove unpicklable object
return state
def __setstate__(self, state):
"""Restore state after unpickling."""
self.__dict__.update(state)
self.connect() # Recreate connection
def use_connection(db_obj):
"""Worker uses the connection."""
print(f"Connection status: {db_obj.conn}")
if __name__ == "__main__":
db = DatabaseConnection("localhost", 5432)
db.connect()
p = multiprocessing.Process(target=use_connection, args=(db,))
p.start()
p.join()
# Output: Connection status: Connected to localhost:5432
Alternative: Use reduce_ex for Complex Objects
For fine-grained control, implement __reduce_ex__:
class CustomObject:
def __init__(self, value):
self.value = value
def __reduce_ex__(self, protocol):
"""Return a tuple (callable, args) to reconstruct the object."""
return (self.__class__, (self.value,))
if __name__ == "__main__":
obj = CustomObject(42)
import pickle
serialized = pickle.dumps(obj)
restored = pickle.loads(serialized)
print(restored.value) # 42
Strategy: Avoid Serializing Complex Objects
The simplest solution is often architectural refactoring:
Bad: Pass a complex Config object to a worker:
def worker(config):
# Many fields of config are unpicklable
return config.database.query(config.query_string)
Good: Extract only what the worker needs:
def worker(db_host, db_port, query_string):
# Worker opens its own connection
db = open_connection(db_host, db_port)
return db.query(query_string)
if __name__ == "__main__":
with multiprocessing.Pool() as pool:
results = pool.starmap(worker, [
("localhost", 5432, "SELECT *"),
("localhost", 5432, "SELECT id FROM users"),
])
Real-World Example: Processing with Worker Initialization
A practical pattern for workers that need expensive setup (e.g., loading a large model):
import multiprocessing
# Global variable in worker process
_model = None
def initialize_worker(model_file):
"""Called once per worker on startup."""
global _model
_model = load_model(model_file) # Expensive operation
def process_batch(batch):
"""Worker uses pre-loaded model."""
return _model.predict(batch)
def load_model(model_file):
"""Simulate model loading."""
return f"Model loaded from {model_file}"
if __name__ == "__main__":
with multiprocessing.Pool(processes=4, initializer=initialize_worker,
initargs=("model.pkl",)) as pool:
batches = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
results = pool.map(process_batch, batches)
print(results)
This approach avoids serializing the model; instead, each worker loads it independently.
Key Takeaways
- Pickle cannot serialize open files, sockets, lambdas, local functions, and database connections.
- To debug, test with
pickle.dumps()directly. - For unpicklable objects, either (1) pass parameters and recreate in worker, (2) implement
__getstate__/__setstate__, or (3) refactor to avoid serialization. - Use
initializerandinitargsin Pool to set up workers with expensive resources (models, connections). - Avoid passing complex objects; pass only primitive data types (strings, ints, lists).
Frequently Asked Questions
Can I use dill instead of pickle?
Yes. dill extends pickle and handles more objects (lambdas, nested functions). Install with pip install dill, then set it globally: import multiprocessing; multiprocessing.set_start_method('dill'). Trade-off: slower and larger serialization.
What if I absolutely need to pass an unpicklable object?
Refactor to pass identifiers (e.g., DB connection ID) and have the worker fetch the actual object. Or use multiprocessing.Manager for a centralized object server.
Why does pickle fail on objects from main?
Pickle stores the module name and function name. If a worker tries to unpickle an object defined in main, it fails because main differs per process. Solution: define classes/functions in a module, not in main.
Is there a performance cost to getstate?
Negligible. If you're optimizing serialization, focus on object size (bytes), not the pickling function itself.
Can I pickle NumPy arrays?
Yes, NumPy arrays are fully picklable. For large arrays, shared memory (see Article 6) is faster.
Further Reading
- Python pickle module documentation — official reference and protocol versions.
- dill: extending pickle — handles lambdas and closures.
- multiprocessing initializer pattern — pool initialization to avoid serialization.
- IPC patterns — when to use Queues vs. Pipes to minimize serialization.