GPU Training with PyTorch on CUDA
GPU training with PyTorch dramatically accelerates model learning—often 10–50x faster than CPU depending on the architecture. CUDA (Compute Unified Device Architecture) enables PyTorch to offload tensor operations to NVIDIA GPUs. Mastering device management, memory optimization, and mixed-precision training unlocks the full potential of GPU hardware for production deep learning.
GPU availability and device management
Check GPU availability, move tensors and models to devices, and monitor memory usage.
Checking CUDA availability and GPU properties
import torch
# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Check number of GPUs
num_gpus = torch.cuda.device_count()
print(f"Number of GPUs: {num_gpus}")
# Get current GPU device
current_device = torch.cuda.current_device()
print(f"Current device index: {current_device}")
# Get GPU properties
for i in range(num_gpus):
gpu_name = torch.cuda.get_device_name(i)
gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1e9
print(f"GPU {i}: {gpu_name}, Memory: {gpu_memory:.2f} GB")
# Get device object
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
Moving models and data to GPU
import torch
import torch.nn as nn
# Create model and move to GPU
model = nn.Sequential(
nn.Linear(1000, 512),
nn.ReLU(),
nn.Linear(512, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Verify model is on GPU
print(f"Model device: {next(model.parameters()).device}")
# Create data and move to same device
x = torch.randn(32, 1000)
y = torch.randint(0, 10, (32,))
x = x.to(device)
y = y.to(device)
# Forward pass on GPU
output = model(x)
print(f"Output device: {output.device}")
# Shorthand methods
model.cuda() # Equivalent to model.to('cuda')
x = x.cuda() # x = x.to('cuda')
y = y.cuda()
# Move back to CPU
model.cpu()
x = x.cpu()
Mixed precision training with Automatic Mixed Precision (AMP)
Use lower precision (float16) for forward/backward passes and float32 for loss scaling. Mixed precision reduces memory by 50% and speeds up training 20–30% with minimal accuracy loss.
Implementing mixed precision with GradScaler
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler
# Model and optimizer
model = nn.Sequential(
nn.Linear(1000, 512),
nn.ReLU(),
nn.Linear(512, 10)
).cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# GradScaler handles loss scaling for float16 stability
scaler = GradScaler()
# Synthetic data
x = torch.randn(64, 1000).cuda()
y = torch.randint(0, 10, (64,)).cuda()
# Training step with mixed precision
with autocast(): # Float16 for forward pass
output = model(x)
loss = criterion(output, y)
# Scale loss and backward (float32)
optimizer.zero_grad()
scaler.scale(loss).backward()
# Unscale and step
scaler.unscale_(optimizer)
scaler.step(optimizer)
scaler.update()
print(f"Loss (mixed precision): {loss.item():.4f}")
# Usage in a full training loop
num_epochs = 3
for epoch in range(num_epochs):
model.train()
epoch_loss = 0
for i in range(10): # Mini batches
x_batch = torch.randn(64, 1000).cuda()
y_batch = torch.randint(0, 10, (64,)).cuda()
# Mixed precision forward
with autocast():
output = model(x_batch)
loss = criterion(output, y_batch)
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
epoch_loss += loss.item()
print(f"Epoch {epoch + 1}: Loss = {epoch_loss / 10:.4f}")
Memory optimization and monitoring
Monitor and optimize GPU memory usage to fit larger models and batches.
| Technique | Effect | When to Use |
|---|---|---|
torch.cuda.empty_cache() | Clear unused memory | After inference, between epochs |
| Gradient checkpointing | Trade memory for computation | Large models, limited GPU RAM |
gradient_accumulation | Process smaller batches, update less often | Simulate large batch sizes |
| Lower precision (float16) | 50% memory reduction | Mixed precision training |
Monitoring and managing memory
import torch
import torch.nn as nn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create a memory-intensive model
model = nn.Sequential(
nn.Linear(10000, 5000),
nn.ReLU(),
nn.Linear(5000, 2000),
nn.ReLU(),
nn.Linear(2000, 10)
).to(device)
# Check allocated and reserved memory
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
# Forward pass
x = torch.randn(256, 10000).to(device)
output = model(x)
# Memory after forward pass
allocated = torch.cuda.memory_allocated() / 1e9
print(f"After forward: Allocated: {allocated:.2f} GB")
# Clear cache
torch.cuda.empty_cache()
allocated = torch.cuda.memory_allocated() / 1e9
print(f"After cache clear: Allocated: {allocated:.2f} GB")
Gradient accumulation for simulating large batches
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(100, 10).cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Simulate batch size 256 with 4 accumulation steps (use batch 64 each)
accumulation_steps = 4
for epoch in range(2):
epoch_loss = 0
optimizer.zero_grad() # Zero once per accumulation cycle
for step in range(16): # 16 mini-batches
x = torch.randn(64, 100).cuda()
y = torch.randint(0, 10, (64,)).cuda()
# Forward pass
output = model(x)
loss = criterion(output, y) / accumulation_steps # Scale loss
# Backward (accumulates gradients)
loss.backward()
# Update after N accumulation steps
if (step + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
epoch_loss += loss.item() * accumulation_steps
print(f"Epoch {epoch + 1}: Loss = {epoch_loss / 16:.4f}")
Multi-GPU training with DataParallel
Scale training across multiple GPUs using nn.DataParallel for simple parallelism.
Data parallelism across GPUs
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Model
model = nn.Sequential(
nn.Linear(1000, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
# Wrap with DataParallel (distributes batches across GPUs)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
print(f"Using {torch.cuda.device_count()} GPUs")
model = model.cuda()
# Create data loader (batch size should be multiple of num_gpus)
X = torch.randn(1000, 1000)
y = torch.randint(0, 10, (1000,))
loader = DataLoader(
TensorDataset(X, y),
batch_size=128 * torch.cuda.device_count(), # Effective batch
shuffle=True
)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Training loop (unchanged—DataParallel handles GPU distribution)
for epoch in range(1):
for batch_x, batch_y in loader:
batch_x = batch_x.cuda()
batch_y = batch_y.cuda()
output = model(batch_x)
loss = criterion(output, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1} complete")
# Note: DataParallel has overhead; for production, use DistributedDataParallel
Performance optimization tips
Identify and eliminate bottlenecks to maximize GPU utilization.
Profiling and optimization strategies
import torch
import torch.nn as nn
import torch.optim as optim
import time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nn.Sequential(
nn.Linear(2000, 1000),
nn.ReLU(),
nn.Linear(1000, 100)
).to(device)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
# Warm up GPU
for _ in range(5):
x = torch.randn(128, 2000).to(device)
_ = model(x)
# Measure time with pin_memory=True
torch.cuda.synchronize()
start = time.time()
for i in range(100):
x = torch.randn(128, 2000).to(device)
y = torch.randint(0, 100, (128,)).to(device)
output = model(x)
loss = criterion(output, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
torch.cuda.synchronize()
elapsed = time.time() - start
print(f"Time for 100 iterations: {elapsed:.3f}s")
print(f"Throughput: {100 * 128 / elapsed:.0f} samples/sec")
# Check GPU utilization
if torch.cuda.is_available():
print(f"GPU Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU Memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
Key Takeaways
- Move models and data to GPU with
.to(device)or.cuda(), and verify placement with.deviceattribute checks. - Mixed precision training with
autocast()andGradScalerreduces memory by 50% and speeds up training 20–30% with minimal accuracy loss. - Monitor GPU memory with
torch.cuda.memory_allocated()and clear unused memory withtorch.cuda.empty_cache(). - Use gradient accumulation to simulate larger batch sizes when GPU memory is limited.
- For multi-GPU training, use
nn.DataParallelfor simplicity ornn.parallel.DistributedDataParallelfor production scalability.
Frequently Asked Questions
Why is my GPU utilization low even with a large batch size?
GPU underutilization often stems from I/O bottlenecks (slow data loading). Increase num_workers in DataLoader, enable pin_memory=True, or pre-load data into GPU memory. Profile with torch.cuda.Event to measure kernel vs data-transfer time.
What is the difference between DataParallel and DistributedDataParallel?
DataParallel is simpler but slower—each forward pass gathers data on the master GPU, creating a bottleneck. DistributedDataParallel (DDP) splits data and computation across GPUs, offering near-linear scaling. Use DDP for production; DataParallel for prototyping.
How do I use mixed precision without autocast()?
Manually cast tensors: x = x.half() for float16, loss = loss.float() to cast back. This is verbose and error-prone; autocast() handles it automatically and is strongly preferred.
Can I train on multiple GPUs without DataParallel?
Yes, manually move tensors to different devices and aggregate gradients, but this is complex. Use DataParallel, DistributedDataParallel, or accelerate libraries like torch_distributed_zero_redundancy_optimizer.
What does torch.cuda.synchronize() do?
CUDA operations are asynchronous—the CPU returns before GPU computation finishes. synchronize() blocks until all GPU operations complete, essential for accurate timing measurements and debugging.