GPU Inference with PyTorch and Python
GPU inference transforms LLM performance from CPU's 10–20 tokens/second to GPU's 50–100 tokens/second or more. PyTorch is the standard deep learning framework for this task, offering straightforward GPU abstractions, automatic mixed-precision (AMP), and multi-GPU distribution.
This tutorial covers GPU memory management, mixed-precision inference, monitoring performance, troubleshooting out-of-memory errors, and multi-GPU strategies. By the end, you'll know how to maximize throughput and minimize latency on any NVIDIA, AMD, or Apple GPU.
GPU Setup and Verification
Ensure CUDA is installed and PyTorch is GPU-enabled:
# Check GPU availability
nvidia-smi # Lists GPUs, memory, and compute capability
# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Or for CPU-only (for testing)
pip install torch
Verify PyTorch sees your GPU:
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU device: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"CUDA version: {torch.version.cuda}")
# Allocate a tensor on GPU
x = torch.randn(1000, 1000, device="cuda")
print(f"Tensor created on {x.device}")
Output for an RTX 4090:
GPU available: True
GPU device: NVIDIA RTX 4090
GPU memory: 24.0 GB
CUDA version: 12.1
Tensor created on cuda:0
Moving Models to GPU
Move a model and its inputs to GPU before inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load on CPU first, then move to GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device) # Move to GPU
# Prepare input on the same device
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generate output (runs on GPU)
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)
Ensure inputs and model are on the same device, or you'll get a CUDA device mismatch error. The .to(device) call automatically moves tensors to GPU memory.
Mixed Precision for Speed and Memory Savings
Mixed precision uses lower-precision (float16, bfloat16) for computation where possible and float32 only when necessary. This reduces VRAM by 30–50% and improves speed by 10–20%:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
# Load model in float16 (half precision)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 # 16-bit instead of 32-bit
)
model = model.cuda() # Move to GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Explain photosynthesis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)
Memory reduction: Mistral-7B in float32 uses 14 GB; in float16 it uses 7 GB. Modern NVIDIA GPUs (V100+, RTX 20 series+) have hardware support for float16, making this both faster and more efficient.
Automatic Mixed Precision (AMP) with Autocast
For custom inference loops, use torch.autocast() to automatically choose precision per layer:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is deep learning?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Use autocast for automatic mixed precision
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.float16):
output_ids = model.generate(**inputs, max_length=100)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)
Autocast runs attention and matrix operations in float16, promoting to float32 only when needed (e.g., for numerical stability). This is faster than forcing float16 globally.
Multi-GPU Inference
Distribute a model across multiple GPUs to increase throughput:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
# Load with device_map="auto" to distribute across GPUs
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", # Automatically split across available GPUs
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompts = [
"What is Python?",
"Explain neural networks.",
"How does Git work?"
]
# Batch inference
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
for prompt, output_id in zip(prompts, output_ids):
response = tokenizer.decode(output_id, skip_special_tokens=True)
print(f"Q: {prompt}\nA: {response}\n")
With two GPUs, device_map="auto" splits the model layers across them, improving throughput by up to 2. For 70B models on two RTX 4090s (48 GB combined), this is the standard approach.
Batch Processing for Higher Throughput
Batching multiple examples together increases GPU utilization:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Batch size = 4
batch_size = 4
prompts = ["Question 1?", "Question 2?", "Question 3?", "Question 4?"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100, num_beams=1)
responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for i, response in enumerate(responses):
print(f"{i+1}. {response}")
Batch size depends on model size and GPU memory. For Mistral-7B on RTX 4070 (12 GB), batch size 2–4 is reasonable; for RTX 4090 (24 GB), batch size 8–16. Doubling batch size roughly doubles throughput (linearly, until memory is saturated).
Memory Profiling and Optimization
Monitor GPU memory usage:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
# Empty cache before loading
torch.cuda.empty_cache()
print(f"Before load: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()
print(f"After model load: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Test"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"After input load: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
print(f"After generate: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
# Clear memory
del model, inputs, output_ids
torch.cuda.empty_cache()
print(f"After cleanup: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
Use nvidia-smi in a separate terminal to monitor real-time GPU memory:
watch -n 0.1 nvidia-smi # Updates every 0.1 seconds
Troubleshooting Out-of-Memory (OOM) Errors
Issue: RuntimeError: CUDA out of memory
Solutions (in order of ease):
- Clear cache: Add
torch.cuda.empty_cache()before loading the model. - Reduce batch size: Change
batch_sizefrom 4 to 2 or 1. - Use float16: Load with
torch_dtype=torch.float16. - Use quantization: Load with 4-bit quantization (3.5 GB for 7B vs. 7 GB with FP16).
- Switch to smaller model: Use Phi-3 (3B, 6 GB) instead of Mistral-7B (7B, 14 GB).
- Add another GPU: Two RTX 4070s (12 GB each) give 24 GB combined.
Key Takeaways
- Move models and inputs to GPU with
.to("cuda")for up to 50 speed-up. - Use float16 mixed precision to save 50% VRAM with minimal quality loss.
- Batch multiple prompts together to increase GPU utilization by 2–3.
- Monitor memory with
torch.cuda.memory_allocated()andnvidia-smi. - Use
device_map="auto"for multi-GPU support without manual distribution code.
Frequently Asked Questions
Is float16 always better than float32?
Not always. Float16 is faster and uses less memory, but can be numerically unstable for some operations. Most transformers handle it fine; older or exotic architectures may need float32. Test your model to be sure.
Can I use multiple GPUs with different memory sizes?
Yes, but device_map="auto" may distribute unevenly, underutilizing the smaller GPU. Manual distribution with device_map={"": 0} (layer 0 on GPU 0) gives more control.
How do I measure inference latency precisely?
Use torch.cuda.Event() to synchronize GPU and measure time:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output_ids = model.generate(**inputs, max_length=100)
end.record()
torch.cuda.synchronize()
print(f"Inference time: {start.elapsed_time(end) / 1000:.3f} seconds")
Can I use a GPU for inference and CPU for other tasks?
Yes. PyTorch uses non-blocking operations by default; the GPU runs asynchronously while the CPU continues. Be aware of synchronization points (.cuda().synchronize()) where the CPU waits.
What's the difference between DP (DataParallel) and DDP (DistributedDataParallel)?
DP replicates the model on each GPU and gathers gradients on GPU 0 (slow). DDP uses independent processes per GPU with peer-to-peer communication (fast). Use DDP for production; DP is simpler for small batches.
Further Reading
- PyTorch CUDA Documentation — Official GPU API reference.
- Mixed Precision Training — NVIDIA's AMP library.
- Hugging Face on Multi-GPU — Multi-GPU inference guide.
- GPU Memory Optimization — Memory-reduction techniques.