Skip to main content

Optimizing Local LLM Inference: Benchmarking and Tuning

Inference optimization is the art of making your LLM faster without sacrificing quality. Most practitioners guess which optimizations help; instead, you should measure. This tutorial teaches systematic benchmarking: quantifying latency, throughput, quality trade-offs, and identifying the bottleneck (GPU memory? attention computation? tokenization?).

By the end, you'll benchmark any setup, profile bottlenecks, and apply data-driven optimizations that reduce latency by 30–50%.

Setting Up Benchmarking Infrastructure

Create a benchmark suite that measures latency, throughput, and quality:

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import json

class LLMBenchmark:
def __init__(self, model_name, device="cuda"):
self.model_name = model_name
self.device = device

# Load model
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if device == "cuda" else torch.float32
)
self.model = self.model.to(device)

# Load evaluation dataset (MMLU for quality)
self.dataset = load_dataset("cais/mmlu", "all")["test"][:100]

def measure_latency(self, prompt, num_runs=5):
"""Measure single-request latency (time to first token + time per token)"""
latencies = []

for _ in range(num_runs):
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

torch.cuda.synchronize() if self.device == "cuda" else None
start = time.time()

with torch.no_grad():
output_ids = self.model.generate(**inputs, max_length=100)

torch.cuda.synchronize() if self.device == "cuda" else None
end = time.time()

latencies.append(end - start)

return {
"mean": sum(latencies) / len(latencies),
"min": min(latencies),
"max": max(latencies),
"std": (sum((x - sum(latencies)/len(latencies))**2 for x in latencies) / len(latencies))**0.5
}

def measure_throughput(self, batch_size=4, num_batches=10):
"""Measure tokens/second (higher is better)"""
prompts = ["Hello, how are you?" for _ in range(batch_size)]

self.tokenizer.pad_token = self.tokenizer.eos_token
inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.device)

torch.cuda.synchronize() if self.device == "cuda" else None
start = time.time()

total_tokens = 0
for _ in range(num_batches):
with torch.no_grad():
output_ids = self.model.generate(**inputs, max_length=100)
total_tokens += output_ids.numel()

torch.cuda.synchronize() if self.device == "cuda" else None
end = time.time()

return total_tokens / (end - start)

def measure_memory(self):
"""Measure peak GPU memory usage"""
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

prompt = "Test prompt"
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

with torch.no_grad():
self.model.generate(**inputs, max_length=100)

peak_memory = torch.cuda.max_memory_allocated() / 1e9 # GB
return peak_memory

# Run benchmark
benchmark = LLMBenchmark("mistral-community/Mistral-7B-Instruct-v0.3")

print("Latency (single request):")
latency = benchmark.measure_latency("What is Python?")
print(f" Mean: {latency['mean']:.2f}s")
print(f" Std: {latency['std']:.2f}s")

print("\nThroughput (batch_size=4):")
throughput = benchmark.measure_throughput(batch_size=4)
print(f" Tokens/second: {throughput:.0f}")

print("\nMemory:")
memory = benchmark.measure_memory()
print(f" Peak: {memory:.1f} GB")

Profiling Bottlenecks

Identify which operations consume the most time:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3", torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")

prompt = "Explain deep learning in 100 words."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Profile with PyTorch Profiler
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA, torch.profiler.ProfilerActivity.CPU],
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/llm_profiling')
) as prof:
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)

# Print top operations by time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))

Output shows operation timing. For a 7B model:

  • Matrix multiply (linear layers): 60–70% of time
  • Attention (softmax, score computation): 20–25%
  • Embeddings: 5–10%

Optimizing the bottleneck (matrix multiply) gives the biggest gains.

Optimization Techniques and Trade-Offs

Technique 1: Flash Attention (Speed +20–30%, Quality: unchanged)

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2" # Enable Flash Attention 2
)
model = model.cuda()

Flash Attention reduces attention memory from O(N²) to O(N) and speeds it by 2–3. No quality loss; always use if your GPU supports it (A100, RTX 4090, H100).

Technique 2: Quantization (Speed +5–10%, Memory -75%, Quality -0.5–1%)

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config
)

4-bit quantization saves VRAM but slightly reduces quality and inference speed. Use when VRAM is constrained.

Technique 3: Smaller Model (Speed +2–3, Memory -50%, Quality variable)

# Instead of Mistral-7B (14 GB, 3s latency)
# Use Phi-3 (3.8 GB, 1.2s latency)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Phi-3 is 40% slower than Mistral-7B but uses 3 less VRAM. For throughput, this is a net positive (more parallelism per GPU).

Technique 4: Batch Processing (Throughput +2–4 at same latency)

tokenizer.pad_token = tokenizer.eos_token
batch = ["Prompt 1", "Prompt 2", "Prompt 3", "Prompt 4"]
inputs = tokenizer(batch, return_tensors="pt", padding=True).to("cuda")

with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)

Batching processes 4 examples nearly 4 faster (GPU utilization increases). Single-request latency is unchanged; throughput increases.

Technique 5: Reduce Context Window (Speed +10–20%, Quality -5–10% if context is used)

# Default context: 8192 tokens for Mistral
# Reduce to 2048 for faster inference
model.config.max_position_embeddings = 2048

Smaller context reduces attention computation (O(N²) scales faster). Quality drops only if your prompts use long context.

Benchmarking Trade-Offs

Create a benchmark comparing speed vs. quality:

import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

configs = [
{"name": "Baseline (FP16)", "quantization": None, "attn": "default", "context": 8192},
{"name": "Flash Attention", "quantization": None, "attn": "flash_attention_2", "context": 8192},
{"name": "4-bit Quantization", "quantization": "4bit", "attn": "default", "context": 8192},
{"name": "Flash + 4-bit", "quantization": "4bit", "attn": "flash_attention_2", "context": 8192},
{"name": "Reduced Context (2K)", "quantization": None, "attn": "default", "context": 2048},
]

results = []

for config in configs:
# Load model with config
if config["quantization"] == "4bit":
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
attn_implementation=config["attn"]
)
else:
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
attn_implementation=config["attn"]
)

model = model.cuda()
tokenizer = AutoTokenizer.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")

# Benchmark
benchmark = LLMBenchmark("mistral-community/Mistral-7B-Instruct-v0.3")
latency = benchmark.measure_latency("Test prompt")["mean"]
throughput = benchmark.measure_throughput()
memory = benchmark.measure_memory()

results.append({
"Config": config["name"],
"Latency (s)": latency,
"Throughput (tokens/s)": throughput,
"Memory (GB)": memory
})

df = pd.DataFrame(results)
print(df.to_string(index=False))

Example output:

Config                  Latency (s)  Throughput (tokens/s)  Memory (GB)
Baseline (FP16) 2.50 40 7.0
Flash Attention 1.80 55 7.0
4-bit Quantization 2.65 38 3.5
Flash + 4-bit 1.95 52 3.5
Reduced Context (2K) 2.10 48 6.5

Flash Attention provides 30% speed-up with zero quality loss. Flash + 4-bit offers the best VRAM efficiency.

Iterative Tuning Checklist

  1. Baseline: Measure unoptimized latency and memory.
  2. Profile: Find the bottleneck (attention? FFN? memory bandwidth?).
  3. Apply one optimization (e.g., Flash Attention).
  4. Remeasure: Confirm the speedup.
  5. Add second optimization (e.g., quantization).
  6. Repeat until hitting a constraint (VRAM limit, quality threshold).
  7. Document: Record the final config for reproducibility.

Key Takeaways

  • Benchmark latency, throughput, and memory separately; they optimize differently.
  • Profile to identify bottlenecks (usually matrix multiply in transformers).
  • Flash Attention is a free 20–30% speedup; always use it.
  • Quantization saves 75% VRAM with 1% quality loss.
  • Batching increases throughput by 2–4 without increasing latency.
  • Smaller models often have better throughput per dollar than larger models.

Frequently Asked Questions

How do I know if my optimization helped?

Measure before and after. Record latency, throughput, memory, and quality metrics. A 10% speedup is measurable; smaller improvements may be noise.

Is there a "one-size-fits-all" optimal configuration?

No. Optimization depends on your constraint: VRAM-constrained? Use 4-bit + small model. Latency-critical? Use Flash Attention + larger model. Throughput-critical? Batch size and smaller model.

Can I optimize inference without changing the model?

Yes. Flash Attention, batching, and context reduction improve speed without retraining. For aggressive speedup, quantization or switching models are needed.

How do I measure quality during optimization?

Run inference on a benchmark dataset (MMLU, Hellaswag) and compare results before/after. A 1–2% drop on benchmarks is usually imperceptible to users.

What's the relationship between latency and throughput?

Latency = time per request (seconds). Throughput = requests/tokens per second. Batching increases throughput without increasing latency. Smaller models reduce latency but may increase cost per task.

Further Reading