Skip to main content

Chunking Strategies for Efficient Batch Processing

Chunking divides a large iterable into batches, reducing inter-process communication overhead and scheduling latency. For a workload with 1 million tasks and 4 workers, submitting 1 million individual tasks causes 1 million round-trips through the IPC queue; chunking into 1000 tasks per worker reduces that to 4000 round-trips. This article explains chunking strategies, formulas for optimal chunk sizes, and how to balance throughput against resource usage.

Why Chunking Matters

When you call pool.map(func, items), the pool divides items into chunks and distributes one chunk per worker. Each worker processes all items in its chunk sequentially, then requests the next chunk. Chunk size directly impacts performance:

  • Too small (size=1): Every item generates an IPC message; scheduling overhead dominates.
  • Too large (size=1,000,000): One worker gets all tasks; others sit idle (load imbalance).
  • Just right: Workers stay busy; IPC overhead is negligible.

Typical throughput difference:

Chunksize 1:       50,000 tasks/sec (high overhead)
Chunksize 1,000: 500,000 tasks/sec (amortized cost)
Chunksize 10,000: 800,000 tasks/sec (nearly optimal)

Understanding Chunksize

When you call pool.map(func, iterable, chunksize=N), the pool:

  1. Divides iterable into chunks of size N.
  2. Assigns one chunk to each worker initially.
  3. When a worker finishes its chunk, it requests the next.
import multiprocessing

def quick_task(x):
return x ** 2

if __name__ == "__main__":
items = range(1_000_000)

with multiprocessing.Pool(processes=4) as pool:
# Chunksize defaults to max(1, len(items) // (4 * 4)) = ~62,500
results_default = pool.map(quick_task, items)

# Explicit chunksize: 1000 items per batch
results_custom = pool.map(quick_task, items, chunksize=1_000)

Chunking Formulas

For CPU-Bound, Uniform Tasks

If all tasks take ~the same time:

chunksize = max(1, total_items // (num_workers * 4))

This formula ensures each worker gets about 4 batches, allowing load rebalancing if some tasks are slightly slower.

Example: 1 million items, 4 workers:

chunksize = max(1, 1_000_000 // (4 * 4)) = 62,500

For Variable-Duration Tasks

Use smaller chunks to improve load balancing:

chunksize = max(1, total_items // (num_workers * 16))

With 16 batches per worker, if some tasks are slow, workers redistribute work fairly.

For I/O-Bound Tasks

Larger chunks reduce overhead, but workers may block on I/O. Use:

chunksize = max(1, total_items // num_workers)

One chunk per worker; low overhead, acceptable if IPC serialization is not the bottleneck.

Benchmark: Chunksize Impact

import multiprocessing
import time

def cpu_bound_task(x):
"""Simulate CPU work."""
return sum(i ** 2 for i in range(100))

if __name__ == "__main__":
items = range(100_000)

chunk_sizes = [1, 10, 100, 1_000, 10_000, 100_000]

for chunksize in chunk_sizes:
start = time.perf_counter()

with multiprocessing.Pool(processes=4) as pool:
results = pool.map(cpu_bound_task, items, chunksize=chunksize)

elapsed = time.perf_counter() - start
throughput = len(items) / elapsed

print(f"Chunksize {chunksize:>6}: {elapsed:.2f}s ({throughput:.0f} tasks/sec)")

Typical output (4-core system):

Chunksize      1: 4.23s (23,640 tasks/sec)
Chunksize 10: 0.93s (107,527 tasks/sec)
Chunksize 100: 0.29s (344,827 tasks/sec)
Chunksize 1,000: 0.28s (357,142 tasks/sec)
Chunksize 10,000: 0.28s (357,142 tasks/sec)
Chunksize 100,000: 0.31s (322,580 tasks/sec) — load imbalance

Sweet spot: chunksize = 1000, just before load imbalance kicks in.

Advanced: Adaptive Chunking for Heterogeneous Tasks

For tasks with varying durations, use imap_unordered() with small chunks:

import multiprocessing
import time
import random

def variable_task(x):
"""Tasks take 0–10 ms."""
time.sleep(random.uniform(0, 0.01))
return x ** 2

if __name__ == "__main__":
items = range(10_000)

with multiprocessing.Pool(processes=4) as pool:
# Small chunksize: better load balancing
for result in pool.imap_unordered(variable_task, items, chunksize=100):
pass # Process results as they arrive

imap_unordered() (vs. map()) returns results in completion order, not submission order. This is faster for heterogeneous workloads because you process fast results immediately rather than waiting for slow ones.

Real-World Example: Data Processing Pipeline

You have 1 million images to resize. Each resize takes 50–100 ms.

import multiprocessing
import time
import random
from PIL import Image
import io

def resize_image(image_data):
"""Simulate image resize: 50–100 ms."""
time.sleep(random.uniform(0.05, 0.1)) # Simulate I/O
return f"Resized: {len(image_data)} bytes"

if __name__ == "__main__":
# Generate fake image data
images = [b"dummy_image_data" * 1000 for _ in range(1000)]

# Strategy 1: Default chunksize (too high overhead)
start = time.perf_counter()
with multiprocessing.Pool(processes=8) as pool:
results = list(pool.imap(resize_image, images, chunksize=1))
elapsed1 = time.perf_counter() - start

# Strategy 2: Optimized chunksize = 10
start = time.perf_counter()
with multiprocessing.Pool(processes=8) as pool:
results = list(pool.imap(resize_image, images, chunksize=10))
elapsed2 = time.perf_counter() - start

# Strategy 3: imap_unordered for load balancing
start = time.perf_counter()
with multiprocessing.Pool(processes=8) as pool:
results = list(pool.imap_unordered(resize_image, images, chunksize=10))
elapsed3 = time.perf_counter() - start

print(f"Chunksize=1: {elapsed1:.2f}s")
print(f"Chunksize=10: {elapsed2:.2f}s")
print(f"imap_unordered: {elapsed3:.2f}s")
# Output: chunksize=10 and imap_unordered typically similar and ~20% faster

Tuning: Cost-Benefit Analysis

For a workload, measure the break-even point:

import multiprocessing
import time

def task(x):
return x ** 2

def benchmark(chunksize, num_items=100_000):
start = time.perf_counter()
with multiprocessing.Pool(processes=4) as pool:
list(pool.map(task, range(num_items), chunksize=chunksize))
return time.perf_counter() - start

if __name__ == "__main__":
print("Chunksize | Time (s) | Overhead % (vs best)")

best_time = float('inf')

for chunksize in [1, 10, 50, 100, 500, 1000, 5000]:
elapsed = benchmark(chunksize)
best_time = min(best_time, elapsed)
overhead = (elapsed / best_time - 1) * 100
print(f"{chunksize:>9} | {elapsed:>8.3f} | {overhead:>6.1f}%")

Run this on your target hardware to find the sweet spot for your specific workload.

Common Pitfall: Chunksize Larger than Iterable

If chunksize exceeds the iterable length, workers receive one small chunk and then sleep:

import multiprocessing

items = range(100)

# BAD: chunksize=1000 > 100 items
with multiprocessing.Pool(processes=4) as pool:
results = pool.map(lambda x: x**2, items, chunksize=1000)
# Worker 1 gets all 100 items; workers 2–4 get nothing and sleep

Always ensure chunksize is reasonable relative to the input size.

Key Takeaways

  • Chunksize divides the iterable into batches, reducing IPC overhead per task.
  • Formula for CPU-bound tasks: chunksize = max(1, total_items // (num_workers * 4)).
  • For variable-duration tasks, use imap_unordered() with small chunksize (10–100) for better load balancing.
  • Benchmark your specific workload; typical sweet spot is 1000–10,000 items per chunk.
  • Very small chunks (size=1) cause scheduling overhead; very large chunks (size = all items) cause load imbalance.
  • Measure before and after chunking optimization; 2–10x throughput improvements are common.

Frequently Asked Questions

What's the default chunksize?

If not specified, Python uses max(1, len(iterable) // (num_processes * 4)), which works well for most workloads.

Does chunksize affect result order in map()?

No, pool.map() always returns results in submission order, regardless of chunksize. Use imap_unordered() if you want results in completion order.

Can I dynamically adjust chunksize?

Yes, but typically not worth the complexity. Profile once, set chunksize, and move on. If workload varies, use imap_unordered() for better adaptive scheduling.

How do I choose between Pool and ProcessPoolExecutor for chunking?

Both support chunksize (ProcessPoolExecutor via map()). Use ProcessPoolExecutor for modern code; it handles chunking automatically.

Is there a maximum useful chunksize?

Yes: beyond len(iterable) // num_workers, workers go idle. Always keep chunksize small enough that workers get multiple batches.

Further Reading