Skip to main content

Autograd in PyTorch: Computing Gradients

Autograd is PyTorch's automatic differentiation engine that computes gradients of tensor operations without explicit code. When you define requires_grad=True on a tensor, PyTorch builds a dynamic computation graph during forward operations and uses backpropagation to compute gradients during the backward pass. This mechanism is the backbone of training neural networks—it eliminates the need to manually write gradient calculations.

Understanding PyTorch's autograd system

Autograd tracks all operations performed on tensors marked with requires_grad=True, building a directed acyclic graph (DAG) that records the computation history. During backpropagation, gradients flow backward through this graph using the chain rule of calculus. According to PyTorch's design documentation (2026), this dynamic computation graph approach allows flexible model architectures that can change based on input data.

Enabling gradient tracking

import torch

# Create a tensor and enable gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"Requires grad: {x.requires_grad}") # True
print(f"Gradient attribute: {x.grad}") # None (no gradient yet)

# You can also enable grad tracking after creation
y = torch.tensor([4.0, 5.0])
y.requires_grad_(True) # In-place operation
print(f"After enabling: {y.requires_grad}") # True

# Create tensors without gradient tracking (more efficient for inference)
z = torch.tensor([1.0, 2.0], requires_grad=False)
print(f"No grad tracking: {z.requires_grad}") # False

Forward pass and computation graphs

During the forward pass, PyTorch automatically records operations on tensors with requires_grad=True, building a computation graph where each operation is a node with references to its inputs.

Building a computation graph

import torch

# Create input tensor with gradient tracking enabled
x = torch.tensor([3.0], requires_grad=True)

# Perform operations—PyTorch records them in the computation graph
y = x ** 2 # y = x^2
z = y * 2 + 1 # z = 2*y + 1 = 2*x^2 + 1

print(f"x: {x}")
print(f"y: {y}")
print(f"z: {z}")

# Check the computation history
print(f"y.grad_fn: {y.grad_fn}") # Output: PowBackward0
print(f"z.grad_fn: {z.grad_fn}") # Output: AddBackward0

# The computation graph is built implicitly—no manual graph construction needed

Backward pass and gradient computation

Call .backward() on a scalar output to trigger backpropagation, computing gradients for all tensors in the computation graph that have requires_grad=True.

Computing gradients with backpropagation

import torch

# Single scalar output
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 2

# Compute gradients via backward pass
y.backward()

# Access computed gradient
print(f"dy/dx = {x.grad}") # Expected: 4*x + 3 = 4*2 + 3 = 11

# Creating multiple tensors and computing gradients
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum() # Sum reduces to scalar

y.backward()
print(f"Gradients for vector: {x.grad}") # [2, 4, 6] (d/dx of x^2)

Understanding gradient accumulation

By default, gradients accumulate in the .grad attribute. Calling backward() multiple times adds to existing gradients rather than replacing them.

import torch

x = torch.tensor([2.0], requires_grad=True)

# First backward pass
y = x ** 2
y.backward()
print(f"First gradient: {x.grad}") # 4.0

# Second backward pass—gradient accumulates
z = x ** 3
z.backward()
print(f"After second backward: {x.grad}") # 4.0 + 12.0 = 16.0

# Reset gradients to zero
x.grad.zero_()
print(f"After zero: {x.grad}") # 0.0

# This is why training loops call optimizer.zero_grad() before each backward

Working with vector outputs and retain_graph

When computing gradients from non-scalar outputs, you must pass a gradient argument to .backward(), or use retain_graph=True to prevent the graph from being freed after backpropagation.

Managing non-scalar outputs

import torch

x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = x ** 2 # Vector output, shape (2, 2)

# For vector output, provide gradient weights (usually ones for loss)
y.backward(torch.ones_like(y))
print(f"Gradients from vector backward:\n{x.grad}")

# Reset for next example
x.grad.zero_()

# Compute multiple backward passes on the same graph
y = x ** 2
loss1 = y.sum()
loss1.backward(retain_graph=True) # Keep graph for another backward
print(f"After first backward: {x.grad}")

# Compute another loss on the same graph
loss2 = (y * 2).sum()
loss2.backward() # This time we don't need retain_graph
print(f"After second backward: {x.grad}")

Detaching tensors and disabling gradients

Detach tensors from the computation graph or use context managers to prevent gradient computation when it's not needed.

MethodUse CaseEffect
.detach()Break a tensor from the graphReturns a new tensor with requires_grad=False
torch.no_grad()Inference or evaluationDisables gradient tracking for a code block
@torch.no_grad()Decorate functionsDisables gradients for entire function
.requires_grad_(False)Freeze parametersIn-place: stop tracking gradients

Detaching and disabling gradients

import torch

# Create tensor with gradient tracking
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Detach breaks the computation graph
y_detached = y.detach()
print(f"Original y requires grad: {y.requires_grad}") # True
print(f"Detached y requires grad: {y_detached.requires_grad}") # False

# Operations on detached tensor don't create graph
z = y_detached + 1
print(f"z.grad_fn: {z.grad_fn}") # None (no backward possible)

# Use torch.no_grad() during inference
with torch.no_grad():
predictions = []
for input_batch in range(5):
logits = input_batch ** 2 # Won't track gradients
predictions.append(logits)

print(f"Predictions computed without tracking gradients")

# Decorator for functions
@torch.no_grad()
def inference_function(x):
return x ** 2 + 1

result = inference_function(torch.tensor([3.0], requires_grad=True))
print(f"Result from decorated function requires grad: {result.requires_grad}") # False

Gradient computation in optimization loops

In typical training loops, you compute forward pass, loss, and backpropagation, then update parameters using gradients. This pattern repeats for each batch.

Standard training pattern with autograd

import torch

# Simple model: y = wx + b (parameters we'll optimize)
w = torch.tensor([0.5], requires_grad=True)
b = torch.tensor([0.1], requires_grad=True)

# Example training data
x_data = torch.tensor([1.0, 2.0, 3.0])
y_true = torch.tensor([2.0, 4.0, 6.0])

# Learning rate
learning_rate = 0.01

# Training loop (simplified)
for epoch in range(3):
# Forward pass
y_pred = w * x_data + b

# Compute loss (mean squared error)
loss = ((y_pred - y_true) ** 2).mean()

print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# Backward pass
loss.backward()

print(f" w.grad = {w.grad.item():.4f}, b.grad = {b.grad.item():.4f}")

# Manual parameter update (using gradients)
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad

# Zero gradients for next iteration
w.grad.zero_()
b.grad.zero_()

Advanced gradient features: grad_fn and leaf tensors

Understanding grad_fn and leaf tensors helps debug gradient issues.

Inspecting gradient flow

import torch

# Leaf tensor (created by user)
x = torch.tensor([2.0], requires_grad=True)
print(f"x.is_leaf: {x.is_leaf}") # True
print(f"x.grad_fn: {x.grad_fn}") # None (leaf tensors have no grad_fn)

# Non-leaf tensor (result of operation)
y = x ** 2
print(f"y.is_leaf: {y.is_leaf}") # False
print(f"y.grad_fn: {y.grad_fn}") # PowBackward0

# Gradients only accumulate in leaf tensors
y.backward()
print(f"x.grad: {x.grad}") # 4.0
print(f"y.grad: {y.grad}") # AttributeError: non-leaf tensor has no grad attribute

# This is by design—gradients for intermediate tensors are freed to save memory

Key Takeaways

  • Enable automatic differentiation by setting requires_grad=True on input tensors; PyTorch then tracks all operations and builds a computation graph automatically.
  • Call .backward() on a scalar loss to trigger backpropagation, which computes gradients for all tensors in the graph using the chain rule.
  • Gradients accumulate in .grad attributes—always call .zero_grad() before each backward pass in a training loop to prevent unwanted accumulation.
  • Detach tensors with .detach() or use torch.no_grad() context managers to break computation graphs, useful for inference and preventing unnecessary gradient tracking.
  • Only leaf tensors (user-created inputs) accumulate gradients; intermediate tensors' gradients are freed automatically to conserve memory.

Frequently Asked Questions

What does retain_graph=True do in backward()?

By default, PyTorch frees the computation graph after backward() to save memory. Setting retain_graph=True prevents this, allowing you to call backward() again on the same graph or compute gradients for multiple loss functions from the same forward pass.

Why should I use torch.no_grad() during evaluation?

Gradient computation adds memory and computational overhead. During inference, gradients are unnecessary, so disabling them with torch.no_grad() reduces memory usage and speeds up forward passes—often by 30-40% depending on the model size.

Can I compute gradients of gradients (second-order derivatives)?

Yes, but PyTorch doesn't track gradients of gradients by default. Use create_graph=True in backward(): loss.backward(create_graph=True) to enable second-order gradients, needed for some optimization algorithms like higher-order Hessian-based methods.

How do I check if a tensor is part of the computation graph?

Use .grad_fn attribute: if it's None, the tensor is a leaf tensor not involved in further computations. If it's a backward function object (e.g., LinearBackward0), the tensor is an intermediate result tracked by autograd.

What happens if I call backward() on a non-scalar tensor?

PyTorch will raise an error unless you pass a gradient tensor with the same shape. For non-scalar outputs, call backward(torch.ones_like(output)) or output.sum().backward() to reduce to a scalar first.

Further Reading