Skip to main content

GPU Inference with PyTorch and Python

GPU inference transforms LLM performance from CPU's 10–20 tokens/second to GPU's 50–100 tokens/second or more. PyTorch is the standard deep learning framework for this task, offering straightforward GPU abstractions, automatic mixed-precision (AMP), and multi-GPU distribution.

This tutorial covers GPU memory management, mixed-precision inference, monitoring performance, troubleshooting out-of-memory errors, and multi-GPU strategies. By the end, you'll know how to maximize throughput and minimize latency on any NVIDIA, AMD, or Apple GPU.

GPU Setup and Verification

Ensure CUDA is installed and PyTorch is GPU-enabled:

# Check GPU availability
nvidia-smi # Lists GPUs, memory, and compute capability

# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Or for CPU-only (for testing)
pip install torch

Verify PyTorch sees your GPU:

import torch

print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU device: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"CUDA version: {torch.version.cuda}")

# Allocate a tensor on GPU
x = torch.randn(1000, 1000, device="cuda")
print(f"Tensor created on {x.device}")

Output for an RTX 4090:

GPU available: True
GPU device: NVIDIA RTX 4090
GPU memory: 24.0 GB
CUDA version: 12.1
Tensor created on cuda:0

Moving Models to GPU

Move a model and its inputs to GPU before inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load on CPU first, then move to GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device) # Move to GPU

# Prepare input on the same device
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate output (runs on GPU)
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)

response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)

Ensure inputs and model are on the same device, or you'll get a CUDA device mismatch error. The .to(device) call automatically moves tensors to GPU memory.

Mixed Precision for Speed and Memory Savings

Mixed precision uses lower-precision (float16, bfloat16) for computation where possible and float32 only when necessary. This reduces VRAM by 30–50% and improves speed by 10–20%:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistral-community/Mistral-7B-Instruct-v0.3"

# Load model in float16 (half precision)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 # 16-bit instead of 32-bit
)
model = model.cuda() # Move to GPU

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Explain photosynthesis."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)

response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)

Memory reduction: Mistral-7B in float32 uses 14 GB; in float16 it uses 7 GB. Modern NVIDIA GPUs (V100+, RTX 20 series+) have hardware support for float16, making this both faster and more efficient.

Automatic Mixed Precision (AMP) with Autocast

For custom inference loops, use torch.autocast() to automatically choose precision per layer:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is deep learning?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Use autocast for automatic mixed precision
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.float16):
output_ids = model.generate(**inputs, max_length=100)

response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)

Autocast runs attention and matrix operations in float16, promoting to float32 only when needed (e.g., for numerical stability). This is faster than forcing float16 globally.

Multi-GPU Inference

Distribute a model across multiple GPUs to increase throughput:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistral-community/Mistral-7B-Instruct-v0.3"

# Load with device_map="auto" to distribute across GPUs
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", # Automatically split across available GPUs
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompts = [
"What is Python?",
"Explain neural networks.",
"How does Git work?"
]

# Batch inference
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)

for prompt, output_id in zip(prompts, output_ids):
response = tokenizer.decode(output_id, skip_special_tokens=True)
print(f"Q: {prompt}\nA: {response}\n")

With two GPUs, device_map="auto" splits the model layers across them, improving throughput by up to 2. For 70B models on two RTX 4090s (48 GB combined), this is the standard approach.

Batch Processing for Higher Throughput

Batching multiple examples together increases GPU utilization:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Batch size = 4
batch_size = 4
prompts = ["Question 1?", "Question 2?", "Question 3?", "Question 4?"]

inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")

with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100, num_beams=1)

responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for i, response in enumerate(responses):
print(f"{i+1}. {response}")

Batch size depends on model size and GPU memory. For Mistral-7B on RTX 4070 (12 GB), batch size 2–4 is reasonable; for RTX 4090 (24 GB), batch size 8–16. Doubling batch size roughly doubles throughput (linearly, until memory is saturated).

Memory Profiling and Optimization

Monitor GPU memory usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistral-community/Mistral-7B-Instruct-v0.3"

# Empty cache before loading
torch.cuda.empty_cache()
print(f"Before load: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()
print(f"After model load: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Test"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"After input load: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
print(f"After generate: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Clear memory
del model, inputs, output_ids
torch.cuda.empty_cache()
print(f"After cleanup: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

Use nvidia-smi in a separate terminal to monitor real-time GPU memory:

watch -n 0.1 nvidia-smi  # Updates every 0.1 seconds

Troubleshooting Out-of-Memory (OOM) Errors

Issue: RuntimeError: CUDA out of memory

Solutions (in order of ease):

  1. Clear cache: Add torch.cuda.empty_cache() before loading the model.
  2. Reduce batch size: Change batch_size from 4 to 2 or 1.
  3. Use float16: Load with torch_dtype=torch.float16.
  4. Use quantization: Load with 4-bit quantization (3.5 GB for 7B vs. 7 GB with FP16).
  5. Switch to smaller model: Use Phi-3 (3B, 6 GB) instead of Mistral-7B (7B, 14 GB).
  6. Add another GPU: Two RTX 4070s (12 GB each) give 24 GB combined.

Key Takeaways

  • Move models and inputs to GPU with .to("cuda") for up to 50 speed-up.
  • Use float16 mixed precision to save 50% VRAM with minimal quality loss.
  • Batch multiple prompts together to increase GPU utilization by 2–3.
  • Monitor memory with torch.cuda.memory_allocated() and nvidia-smi.
  • Use device_map="auto" for multi-GPU support without manual distribution code.

Frequently Asked Questions

Is float16 always better than float32?

Not always. Float16 is faster and uses less memory, but can be numerically unstable for some operations. Most transformers handle it fine; older or exotic architectures may need float32. Test your model to be sure.

Can I use multiple GPUs with different memory sizes?

Yes, but device_map="auto" may distribute unevenly, underutilizing the smaller GPU. Manual distribution with device_map={"": 0} (layer 0 on GPU 0) gives more control.

How do I measure inference latency precisely?

Use torch.cuda.Event() to synchronize GPU and measure time:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output_ids = model.generate(**inputs, max_length=100)
end.record()
torch.cuda.synchronize()
print(f"Inference time: {start.elapsed_time(end) / 1000:.3f} seconds")

Can I use a GPU for inference and CPU for other tasks?

Yes. PyTorch uses non-blocking operations by default; the GPU runs asynchronously while the CPU continues. Be aware of synchronization points (.cuda().synchronize()) where the CPU waits.

What's the difference between DP (DataParallel) and DDP (DistributedDataParallel)?

DP replicates the model on each GPU and gathers gradients on GPU 0 (slow). DDP uses independent processes per GPU with peer-to-peer communication (fast). Use DDP for production; DP is simpler for small batches.

Further Reading