Skip to main content

GPU Training with PyTorch on CUDA

GPU training with PyTorch dramatically accelerates model learning—often 10–50x faster than CPU depending on the architecture. CUDA (Compute Unified Device Architecture) enables PyTorch to offload tensor operations to NVIDIA GPUs. Mastering device management, memory optimization, and mixed-precision training unlocks the full potential of GPU hardware for production deep learning.

GPU availability and device management

Check GPU availability, move tensors and models to devices, and monitor memory usage.

Checking CUDA availability and GPU properties

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")

# Check number of GPUs
num_gpus = torch.cuda.device_count()
print(f"Number of GPUs: {num_gpus}")

# Get current GPU device
current_device = torch.cuda.current_device()
print(f"Current device index: {current_device}")

# Get GPU properties
for i in range(num_gpus):
gpu_name = torch.cuda.get_device_name(i)
gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1e9
print(f"GPU {i}: {gpu_name}, Memory: {gpu_memory:.2f} GB")

# Get device object
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Moving models and data to GPU

import torch
import torch.nn as nn

# Create model and move to GPU
model = nn.Sequential(
nn.Linear(1000, 512),
nn.ReLU(),
nn.Linear(512, 128),
nn.ReLU(),
nn.Linear(128, 10)
)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Verify model is on GPU
print(f"Model device: {next(model.parameters()).device}")

# Create data and move to same device
x = torch.randn(32, 1000)
y = torch.randint(0, 10, (32,))

x = x.to(device)
y = y.to(device)

# Forward pass on GPU
output = model(x)
print(f"Output device: {output.device}")

# Shorthand methods
model.cuda() # Equivalent to model.to('cuda')
x = x.cuda() # x = x.to('cuda')
y = y.cuda()

# Move back to CPU
model.cpu()
x = x.cpu()

Mixed precision training with Automatic Mixed Precision (AMP)

Use lower precision (float16) for forward/backward passes and float32 for loss scaling. Mixed precision reduces memory by 50% and speeds up training 20–30% with minimal accuracy loss.

Implementing mixed precision with GradScaler

import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler

# Model and optimizer
model = nn.Sequential(
nn.Linear(1000, 512),
nn.ReLU(),
nn.Linear(512, 10)
).cuda()

optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# GradScaler handles loss scaling for float16 stability
scaler = GradScaler()

# Synthetic data
x = torch.randn(64, 1000).cuda()
y = torch.randint(0, 10, (64,)).cuda()

# Training step with mixed precision
with autocast(): # Float16 for forward pass
output = model(x)
loss = criterion(output, y)

# Scale loss and backward (float32)
optimizer.zero_grad()
scaler.scale(loss).backward()

# Unscale and step
scaler.unscale_(optimizer)
scaler.step(optimizer)
scaler.update()

print(f"Loss (mixed precision): {loss.item():.4f}")

# Usage in a full training loop
num_epochs = 3
for epoch in range(num_epochs):
model.train()
epoch_loss = 0

for i in range(10): # Mini batches
x_batch = torch.randn(64, 1000).cuda()
y_batch = torch.randint(0, 10, (64,)).cuda()

# Mixed precision forward
with autocast():
output = model(x_batch)
loss = criterion(output, y_batch)

optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

epoch_loss += loss.item()

print(f"Epoch {epoch + 1}: Loss = {epoch_loss / 10:.4f}")

Memory optimization and monitoring

Monitor and optimize GPU memory usage to fit larger models and batches.

TechniqueEffectWhen to Use
torch.cuda.empty_cache()Clear unused memoryAfter inference, between epochs
Gradient checkpointingTrade memory for computationLarge models, limited GPU RAM
gradient_accumulationProcess smaller batches, update less oftenSimulate large batch sizes
Lower precision (float16)50% memory reductionMixed precision training

Monitoring and managing memory

import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create a memory-intensive model
model = nn.Sequential(
nn.Linear(10000, 5000),
nn.ReLU(),
nn.Linear(5000, 2000),
nn.ReLU(),
nn.Linear(2000, 10)
).to(device)

# Check allocated and reserved memory
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

# Forward pass
x = torch.randn(256, 10000).to(device)
output = model(x)

# Memory after forward pass
allocated = torch.cuda.memory_allocated() / 1e9
print(f"After forward: Allocated: {allocated:.2f} GB")

# Clear cache
torch.cuda.empty_cache()
allocated = torch.cuda.memory_allocated() / 1e9
print(f"After cache clear: Allocated: {allocated:.2f} GB")

Gradient accumulation for simulating large batches

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(100, 10).cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Simulate batch size 256 with 4 accumulation steps (use batch 64 each)
accumulation_steps = 4

for epoch in range(2):
epoch_loss = 0
optimizer.zero_grad() # Zero once per accumulation cycle

for step in range(16): # 16 mini-batches
x = torch.randn(64, 100).cuda()
y = torch.randint(0, 10, (64,)).cuda()

# Forward pass
output = model(x)
loss = criterion(output, y) / accumulation_steps # Scale loss

# Backward (accumulates gradients)
loss.backward()

# Update after N accumulation steps
if (step + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

epoch_loss += loss.item() * accumulation_steps

print(f"Epoch {epoch + 1}: Loss = {epoch_loss / 16:.4f}")

Multi-GPU training with DataParallel

Scale training across multiple GPUs using nn.DataParallel for simple parallelism.

Data parallelism across GPUs

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Model
model = nn.Sequential(
nn.Linear(1000, 512),
nn.ReLU(),
nn.Linear(512, 10)
)

# Wrap with DataParallel (distributes batches across GPUs)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
print(f"Using {torch.cuda.device_count()} GPUs")

model = model.cuda()

# Create data loader (batch size should be multiple of num_gpus)
X = torch.randn(1000, 1000)
y = torch.randint(0, 10, (1000,))
loader = DataLoader(
TensorDataset(X, y),
batch_size=128 * torch.cuda.device_count(), # Effective batch
shuffle=True
)

optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Training loop (unchanged—DataParallel handles GPU distribution)
for epoch in range(1):
for batch_x, batch_y in loader:
batch_x = batch_x.cuda()
batch_y = batch_y.cuda()

output = model(batch_x)
loss = criterion(output, batch_y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Epoch {epoch + 1} complete")

# Note: DataParallel has overhead; for production, use DistributedDataParallel

Performance optimization tips

Identify and eliminate bottlenecks to maximize GPU utilization.

Profiling and optimization strategies

import torch
import torch.nn as nn
import torch.optim as optim
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = nn.Sequential(
nn.Linear(2000, 1000),
nn.ReLU(),
nn.Linear(1000, 100)
).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Warm up GPU
for _ in range(5):
x = torch.randn(128, 2000).to(device)
_ = model(x)

# Measure time with pin_memory=True
torch.cuda.synchronize()
start = time.time()

for i in range(100):
x = torch.randn(128, 2000).to(device)
y = torch.randint(0, 100, (128,)).to(device)

output = model(x)
loss = criterion(output, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

torch.cuda.synchronize()
elapsed = time.time() - start

print(f"Time for 100 iterations: {elapsed:.3f}s")
print(f"Throughput: {100 * 128 / elapsed:.0f} samples/sec")

# Check GPU utilization
if torch.cuda.is_available():
print(f"GPU Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU Memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Key Takeaways

  • Move models and data to GPU with .to(device) or .cuda(), and verify placement with .device attribute checks.
  • Mixed precision training with autocast() and GradScaler reduces memory by 50% and speeds up training 20–30% with minimal accuracy loss.
  • Monitor GPU memory with torch.cuda.memory_allocated() and clear unused memory with torch.cuda.empty_cache().
  • Use gradient accumulation to simulate larger batch sizes when GPU memory is limited.
  • For multi-GPU training, use nn.DataParallel for simplicity or nn.parallel.DistributedDataParallel for production scalability.

Frequently Asked Questions

Why is my GPU utilization low even with a large batch size?

GPU underutilization often stems from I/O bottlenecks (slow data loading). Increase num_workers in DataLoader, enable pin_memory=True, or pre-load data into GPU memory. Profile with torch.cuda.Event to measure kernel vs data-transfer time.

What is the difference between DataParallel and DistributedDataParallel?

DataParallel is simpler but slower—each forward pass gathers data on the master GPU, creating a bottleneck. DistributedDataParallel (DDP) splits data and computation across GPUs, offering near-linear scaling. Use DDP for production; DataParallel for prototyping.

How do I use mixed precision without autocast()?

Manually cast tensors: x = x.half() for float16, loss = loss.float() to cast back. This is verbose and error-prone; autocast() handles it automatically and is strongly preferred.

Can I train on multiple GPUs without DataParallel?

Yes, manually move tensors to different devices and aggregate gradients, but this is complex. Use DataParallel, DistributedDataParallel, or accelerate libraries like torch_distributed_zero_redundancy_optimizer.

What does torch.cuda.synchronize() do?

CUDA operations are asynchronous—the CPU returns before GPU computation finishes. synchronize() blocks until all GPU operations complete, essential for accurate timing measurements and debugging.

Further Reading