Optimizing Local LLM Inference: Benchmarking and Tuning
Inference optimization is the art of making your LLM faster without sacrificing quality. Most practitioners guess which optimizations help; instead, you should measure. This tutorial teaches systematic benchmarking: quantifying latency, throughput, quality trade-offs, and identifying the bottleneck (GPU memory? attention computation? tokenization?).
By the end, you'll benchmark any setup, profile bottlenecks, and apply data-driven optimizations that reduce latency by 30–50%.
Setting Up Benchmarking Infrastructure
Create a benchmark suite that measures latency, throughput, and quality:
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import json
class LLMBenchmark:
def __init__(self, model_name, device="cuda"):
self.model_name = model_name
self.device = device
# Load model
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if device == "cuda" else torch.float32
)
self.model = self.model.to(device)
# Load evaluation dataset (MMLU for quality)
self.dataset = load_dataset("cais/mmlu", "all")["test"][:100]
def measure_latency(self, prompt, num_runs=5):
"""Measure single-request latency (time to first token + time per token)"""
latencies = []
for _ in range(num_runs):
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
torch.cuda.synchronize() if self.device == "cuda" else None
start = time.time()
with torch.no_grad():
output_ids = self.model.generate(**inputs, max_length=100)
torch.cuda.synchronize() if self.device == "cuda" else None
end = time.time()
latencies.append(end - start)
return {
"mean": sum(latencies) / len(latencies),
"min": min(latencies),
"max": max(latencies),
"std": (sum((x - sum(latencies)/len(latencies))**2 for x in latencies) / len(latencies))**0.5
}
def measure_throughput(self, batch_size=4, num_batches=10):
"""Measure tokens/second (higher is better)"""
prompts = ["Hello, how are you?" for _ in range(batch_size)]
self.tokenizer.pad_token = self.tokenizer.eos_token
inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.device)
torch.cuda.synchronize() if self.device == "cuda" else None
start = time.time()
total_tokens = 0
for _ in range(num_batches):
with torch.no_grad():
output_ids = self.model.generate(**inputs, max_length=100)
total_tokens += output_ids.numel()
torch.cuda.synchronize() if self.device == "cuda" else None
end = time.time()
return total_tokens / (end - start)
def measure_memory(self):
"""Measure peak GPU memory usage"""
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
prompt = "Test prompt"
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
self.model.generate(**inputs, max_length=100)
peak_memory = torch.cuda.max_memory_allocated() / 1e9 # GB
return peak_memory
# Run benchmark
benchmark = LLMBenchmark("mistral-community/Mistral-7B-Instruct-v0.3")
print("Latency (single request):")
latency = benchmark.measure_latency("What is Python?")
print(f" Mean: {latency['mean']:.2f}s")
print(f" Std: {latency['std']:.2f}s")
print("\nThroughput (batch_size=4):")
throughput = benchmark.measure_throughput(batch_size=4)
print(f" Tokens/second: {throughput:.0f}")
print("\nMemory:")
memory = benchmark.measure_memory()
print(f" Peak: {memory:.1f} GB")
Profiling Bottlenecks
Identify which operations consume the most time:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3", torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")
prompt = "Explain deep learning in 100 words."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Profile with PyTorch Profiler
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA, torch.profiler.ProfilerActivity.CPU],
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/llm_profiling')
) as prof:
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
# Print top operations by time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))
Output shows operation timing. For a 7B model:
- Matrix multiply (linear layers): 60–70% of time
- Attention (softmax, score computation): 20–25%
- Embeddings: 5–10%
Optimizing the bottleneck (matrix multiply) gives the biggest gains.
Optimization Techniques and Trade-Offs
Technique 1: Flash Attention (Speed +20–30%, Quality: unchanged)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2" # Enable Flash Attention 2
)
model = model.cuda()
Flash Attention reduces attention memory from O(N²) to O(N) and speeds it by 2–3. No quality loss; always use if your GPU supports it (A100, RTX 4090, H100).
Technique 2: Quantization (Speed +5–10%, Memory -75%, Quality -0.5–1%)
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config
)
4-bit quantization saves VRAM but slightly reduces quality and inference speed. Use when VRAM is constrained.
Technique 3: Smaller Model (Speed +2–3, Memory -50%, Quality variable)
# Instead of Mistral-7B (14 GB, 3s latency)
# Use Phi-3 (3.8 GB, 1.2s latency)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
Phi-3 is 40% slower than Mistral-7B but uses 3 less VRAM. For throughput, this is a net positive (more parallelism per GPU).
Technique 4: Batch Processing (Throughput +2–4 at same latency)
tokenizer.pad_token = tokenizer.eos_token
batch = ["Prompt 1", "Prompt 2", "Prompt 3", "Prompt 4"]
inputs = tokenizer(batch, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
Batching processes 4 examples nearly 4 faster (GPU utilization increases). Single-request latency is unchanged; throughput increases.
Technique 5: Reduce Context Window (Speed +10–20%, Quality -5–10% if context is used)
# Default context: 8192 tokens for Mistral
# Reduce to 2048 for faster inference
model.config.max_position_embeddings = 2048
Smaller context reduces attention computation (O(N²) scales faster). Quality drops only if your prompts use long context.
Benchmarking Trade-Offs
Create a benchmark comparing speed vs. quality:
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
configs = [
{"name": "Baseline (FP16)", "quantization": None, "attn": "default", "context": 8192},
{"name": "Flash Attention", "quantization": None, "attn": "flash_attention_2", "context": 8192},
{"name": "4-bit Quantization", "quantization": "4bit", "attn": "default", "context": 8192},
{"name": "Flash + 4-bit", "quantization": "4bit", "attn": "flash_attention_2", "context": 8192},
{"name": "Reduced Context (2K)", "quantization": None, "attn": "default", "context": 2048},
]
results = []
for config in configs:
# Load model with config
if config["quantization"] == "4bit":
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
attn_implementation=config["attn"]
)
else:
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
attn_implementation=config["attn"]
)
model = model.cuda()
tokenizer = AutoTokenizer.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")
# Benchmark
benchmark = LLMBenchmark("mistral-community/Mistral-7B-Instruct-v0.3")
latency = benchmark.measure_latency("Test prompt")["mean"]
throughput = benchmark.measure_throughput()
memory = benchmark.measure_memory()
results.append({
"Config": config["name"],
"Latency (s)": latency,
"Throughput (tokens/s)": throughput,
"Memory (GB)": memory
})
df = pd.DataFrame(results)
print(df.to_string(index=False))
Example output:
Config Latency (s) Throughput (tokens/s) Memory (GB)
Baseline (FP16) 2.50 40 7.0
Flash Attention 1.80 55 7.0
4-bit Quantization 2.65 38 3.5
Flash + 4-bit 1.95 52 3.5
Reduced Context (2K) 2.10 48 6.5
Flash Attention provides 30% speed-up with zero quality loss. Flash + 4-bit offers the best VRAM efficiency.
Iterative Tuning Checklist
- Baseline: Measure unoptimized latency and memory.
- Profile: Find the bottleneck (attention? FFN? memory bandwidth?).
- Apply one optimization (e.g., Flash Attention).
- Remeasure: Confirm the speedup.
- Add second optimization (e.g., quantization).
- Repeat until hitting a constraint (VRAM limit, quality threshold).
- Document: Record the final config for reproducibility.
Key Takeaways
- Benchmark latency, throughput, and memory separately; they optimize differently.
- Profile to identify bottlenecks (usually matrix multiply in transformers).
- Flash Attention is a free 20–30% speedup; always use it.
- Quantization saves 75% VRAM with 1% quality loss.
- Batching increases throughput by 2–4 without increasing latency.
- Smaller models often have better throughput per dollar than larger models.
Frequently Asked Questions
How do I know if my optimization helped?
Measure before and after. Record latency, throughput, memory, and quality metrics. A 10% speedup is measurable; smaller improvements may be noise.
Is there a "one-size-fits-all" optimal configuration?
No. Optimization depends on your constraint: VRAM-constrained? Use 4-bit + small model. Latency-critical? Use Flash Attention + larger model. Throughput-critical? Batch size and smaller model.
Can I optimize inference without changing the model?
Yes. Flash Attention, batching, and context reduction improve speed without retraining. For aggressive speedup, quantization or switching models are needed.
How do I measure quality during optimization?
Run inference on a benchmark dataset (MMLU, Hellaswag) and compare results before/after. A 1–2% drop on benchmarks is usually imperceptible to users.
What's the relationship between latency and throughput?
Latency = time per request (seconds). Throughput = requests/tokens per second. Batching increases throughput without increasing latency. Smaller models reduce latency but may increase cost per task.
Further Reading
- PyTorch Profiler Guide — Official profiling documentation.
- Flash Attention Paper — Algorithm and benchmarks.
- vLLM Benchmarks — Comparison of optimization techniques.
- LLM Inference Optimization Survey — Comprehensive overview of techniques.