CPU-Only LLM Inference Optimization
CPU inference is 10–20 slower than GPU but can run on any laptop or server without a dedicated accelerator. For applications that prioritize accessibility over speed, or for small models (3B–7B quantized) with low batch sizes, CPU-only inference is viable.
This tutorial covers enabling multithreading for CPU utilization, quantizing aggressively to reduce memory, using ONNX Runtime for faster execution, and profiling bottlenecks. By the end, you'll optimize a 3B or 7B model to run at 20–30 tokens/second on modern CPUs.
Understanding CPU Bottlenecks
CPUs are efficient at sequential computation but struggle with the matrix multiplications that dominate neural networks. PyTorch's default single-threaded CPU inference uses only 1–2 of your 8–16 cores. The solution: enable multithreading to distribute work across cores.
CPU inference timeline for a 7B model generating 100 tokens:
- Single-threaded: 8–10 seconds (one core fully loaded)
- 8-threaded: 1.5–2 seconds (all cores utilized)
- Quantized + 8-threaded: 0.8–1 second
Enabling Multithreading in PyTorch
By default, PyTorch uses 1 thread on CPU. Increase this to the number of physical cores (not logical threads):
import torch
import os
# Set thread count BEFORE loading the model
num_threads = os.cpu_count() // 2 # Use half of logical cores (accounts for hyperthreading)
torch.set_num_threads(num_threads)
torch.set_num_interop_threads(1)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "microsoft/Phi-3-mini-4k-instruct" # Small 3.8B model
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is Python?"
inputs = tokenizer(prompt, return_tensors="pt")
import time
start = time.time()
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
end = time.time()
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated in {end - start:.2f}s")
print(response)
Set this BEFORE loading the model. On a machine with 8 cores:
torch.set_num_threads(4)— 4 workers (good balance)torch.set_num_threads(8)— 8 workers (maximum, may be slower due to contention)
Benchmark to find the sweet spot for your CPU.
Quantization for CPU
On CPU, aggressive quantization (4-bit or 5-bit) is more important than on GPU because CPU memory is limited and bandwidth is precious:
pip install bitsandbytes
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "microsoft/Phi-3-mini-4k-instruct"
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# Load model in 4-bit (CPU inference)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="cpu" # Keep on CPU (BitsAndBytes works on CPU too)
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set thread count
import os
torch.set_num_threads(os.cpu_count() // 2)
prompt = "Explain quantum computing."
inputs = tokenizer(prompt, return_tensors="pt")
import time
start = time.time()
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
end = time.time()
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated {len(output_ids[0])} tokens in {end - start:.2f}s")
Memory savings on CPU with 4-bit quantization:
- Phi-3 3.8B: 7.6 GB → 1.9 GB
- Mistral-7B: 14 GB → 3.5 GB
This makes models run on laptops with 8 GB RAM (with no room for anything else) or desktops with 16 GB RAM comfortably.
Using ONNX Runtime for Faster CPU Inference
ONNX Runtime compiles models to machine code optimized for your specific CPU, achieving 2–4 speed-ups over PyTorch:
pip install optimum[onnx] onnxruntime
Convert a model to ONNX and load it:
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
model_name = "microsoft/Phi-3-mini-4k-instruct"
# Load ONNX-optimized model (converts on first load, caches result)
model = ORTModelForCausalLM.from_pretrained(
model_name,
from_transformers=True,
export=True # Convert to ONNX on first load
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt")
import time
import torch
start = time.time()
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
end = time.time()
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"ONNX inference: {end - start:.2f}s")
print(response)
ONNX Runtime uses graph optimization, kernel fusion, and operator-specific implementations. For CPU inference, it's the fastest option available.
Batch Processing on CPU
Batching multiplies throughput but requires proportional memory. On CPU with 16 GB RAM:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set thread count
import os
torch.set_num_threads(os.cpu_count() // 2)
# Batch size 2 (larger batches will OOM on CPU)
prompts = [
"What is Python?",
"Explain neural networks."
]
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
import time
start = time.time()
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100, do_sample=False)
end = time.time()
responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}")
print(f"A: {response}\n")
print(f"Total time for batch of 2: {end - start:.2f}s")
Batch size on CPU is limited by RAM. Mistral-7B at batch size 2 needs ~28 GB (2 × 14 GB unquantized); quantized to 4-bit, it needs ~7 GB. Use batch size 1–2 on CPU; rely on single-example streaming for throughput.
Profile to Find Bottlenecks
Use PyTorch's profiler to identify what's slow:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
torch.set_num_threads(os.cpu_count() // 2)
model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt")
# Profile the generate call
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU],
record_shapes=True
) as prof:
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=50)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
Output shows which operations consume the most CPU time. Common bottlenecks:
- Embedding lookup (first token)
- Matrix multiply (dominant in attention and feedforward)
- Softmax (normalization in attention)
Ollama for Simplified CPU Inference
For maximum simplicity on CPU, use Ollama's pre-quantized models:
# Install Ollama
# Run in terminal: ollama serve
# Pull a CPU-friendly model
ollama pull phi # 2.7B, very fast
from ollama import Client
client = Client(host='http://localhost:11434')
# Stream responses from CPU
print("Output: ", end="", flush=True)
for chunk in client.generate(
model='phi',
prompt='Explain machine learning.',
stream=True
):
print(chunk['response'], end="", flush=True)
print()
Ollama handles quantization and multithreading automatically. It's slower than raw PyTorch but much easier.
Comparison Table: CPU Optimization Techniques
| Technique | Speed | VRAM | Complexity | Best For |
|---|---|---|---|---|
| PyTorch + multithreading | 1 | High | Low | Learning |
| PyTorch + 4-bit quantization | 1 | Low | Medium | Production, small hardware |
| ONNX Runtime | 2–3 | Low | High | Maximum performance |
| Ollama | 0.8 | Low | Very low | Simplicity |
Key Takeaways
- Enable multithreading with
torch.set_num_threads(num_cores)to use all CPU cores. - Quantize aggressively (4-bit) on CPU to fit in available RAM.
- Use ONNX Runtime for 2–3 speed-up over PyTorch.
- Batch size on CPU is limited by RAM; use batch size 1–2.
- Profile with torch.profiler to find bottlenecks.
Frequently Asked Questions
Is CPU inference ever faster than GPU for small models?
For very small models (< 1B) on old GPUs (Tesla K80), CPU may be competitive. For modern GPUs (V100+) and models >= 3B, GPU is always faster.
Can I use CPU during model loading and GPU during inference?
Yes. Load on CPU with model = Model.from_pretrained(...) then .to("cuda"). This saves CPU→GPU transfer time for large models.
How much faster is ONNX Runtime than PyTorch on CPU?
2–3 typical; up to 5 for optimized models. The exact speedup depends on model architecture and CPU.
Can I run LLMs on a Raspberry Pi (ARM CPU)?
Yes, with aggressive quantization and small models (< 1B). Phi-1.5 quantized to 4-bit is 1 GB and runs on Raspberry Pi 5 (8 GB), but slowly (1–2 tokens/second).
Why does PyTorch use only 1 thread by default?
For consistency and reproducibility. Default single-threaded execution is deterministic; multithreading introduces race conditions that can vary results slightly.
Further Reading
- PyTorch CPU Optimization — Official guide.
- ONNX Runtime Performance — Benchmarks and tuning.
- Phi-3 Model Card — CPU-friendly model details.
- Ollama Documentation — Simple CPU/GPU inference.