Skip to main content

Building Neural Networks with Modules

PyTorch's nn.Module is a container class that encapsulates neural network layers, parameters, and the forward computation logic. Every neural network model you build inherits from nn.Module, giving you automatic support for gradient tracking, parameter management, device movement, and serialization. This organizational pattern scales from simple linear classifiers to state-of-the-art vision transformers.

Understanding nn.Module and its structure

nn.Module is PyTorch's abstraction for neural network components. It manages a collection of parameters (learnable weights and biases) and child modules (sub-layers), automatically registering them so that operations like .parameters(), .to(device), and .train() apply recursively. According to the PyTorch API documentation (2026), every module must implement a forward() method that defines the computation flow during the forward pass.

Creating your first module

import torch
import torch.nn as nn

# Define a simple linear classifier
class LinearClassifier(nn.Module):
def __init__(self, input_size, num_classes):
super(LinearClassifier, self).__init__()
# Define layers as attributes (registered as parameters)
self.linear = nn.Linear(input_size, num_classes)

def forward(self, x):
# Forward pass: apply the linear layer
return self.linear(x)

# Create an instance and inspect
model = LinearClassifier(input_size=10, num_classes=3)
print(f"Model:\n{model}")

# Perform inference
x = torch.randn(5, 10) # Batch of 5 samples, 10 features
output = model(x)
print(f"Output shape: {output.shape}") # [5, 3]

Stacking layers in sequential and custom modules

Build more complex networks by combining multiple layers, either using nn.Sequential for simple stacks or custom modules for flexible architectures.

Using nn.Sequential for simple stacks

import torch
import torch.nn as nn

# Stack layers sequentially
model = nn.Sequential(
nn.Linear(20, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10)
)

# Forward pass
x = torch.randn(8, 20) # Batch of 8, 20 features
output = model(x)
print(f"Sequential output shape: {output.shape}") # [8, 10]

# Access layers by index
first_layer = model[0]
print(f"First layer: {first_layer}")

Custom module with multiple layers

import torch
import torch.nn as nn

class MultiLayerNetwork(nn.Module):
def __init__(self, input_size, hidden_size, output_size, dropout_rate=0.2):
super(MultiLayerNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout_rate)
self.fc2 = nn.Linear(hidden_size, hidden_size // 2)
self.fc3 = nn.Linear(hidden_size // 2, output_size)

def forward(self, x):
# First hidden layer with dropout
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)

# Second hidden layer
x = self.fc2(x)
x = self.relu(x)
x = self.dropout(x)

# Output layer
x = self.fc3(x)
return x

# Create and test the model
model = MultiLayerNetwork(input_size=100, hidden_size=64, output_size=5)
x = torch.randn(16, 100) # Batch of 16 samples
output = model(x)
print(f"Output shape: {output.shape}") # [16, 5]

Parameter management and optimization

Access and manipulate model parameters, which are automatically tracked for gradient computation and optimization.

Inspecting and modifying parameters

import torch
import torch.nn as nn

model = nn.Linear(10, 5)

# Get all parameters
print("Model parameters:")
for name, param in model.named_parameters():
print(f" {name}: shape {param.shape}, requires_grad={param.requires_grad}")

# Access specific parameters
weight = model.weight # Shape: [5, 10]
bias = model.bias # Shape: [5]

print(f"\nWeight shape: {weight.shape}")
print(f"Bias shape: {bias.shape}")

# Freeze specific parameters (disable gradient tracking)
model.bias.requires_grad = False

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

Device management and model movement

Move entire models to different devices (CPU, GPU, multi-GPU) with a single call.

Device MethodEffectExample
.to(device)Move model to devicemodel.to('cuda')
.cuda()Move to GPU (shorthand)model.cuda()
.cpu()Move to CPUmodel.cpu()
.deviceCheck current deviceprint(model.fc1.weight.device)

Moving models to GPU

import torch
import torch.nn as nn

model = nn.Linear(10, 5)

# Check device before moving
print(f"Initial device: {model.weight.device}") # cpu

# Move to GPU if available
if torch.cuda.is_available():
model = model.to('cuda')
print(f"After moving to GPU: {model.weight.device}") # cuda:0

# Input data must be on the same device
x = torch.randn(4, 10).to('cuda')
output = model(x)
print(f"Output device: {output.device}") # cuda:0
else:
print("GPU not available, keeping on CPU")

# For flexibility, determine device dynamically
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
x = torch.randn(4, 10).to(device)
output = model(x)

Training vs evaluation modes

Switch models between training and evaluation modes to enable/disable regularization techniques like dropout and batch normalization.

Setting model modes

import torch
import torch.nn as nn

class ModelWithDropout(nn.Module):
def __init__(self):
super(ModelWithDropout, self).__init__()
self.fc1 = nn.Linear(10, 64)
self.dropout = nn.Dropout(0.5) # 50% dropout
self.fc2 = nn.Linear(64, 1)

def forward(self, x):
x = self.fc1(x)
x = self.dropout(x) # Active in train mode, inactive in eval
x = torch.sigmoid(self.fc2(x))
return x

model = ModelWithDropout()

# Training mode (dropout active)
model.train()
x = torch.randn(8, 10)
output_train1 = model(x)
output_train2 = model(x)
print(f"Training outputs different (dropout active): {not torch.allclose(output_train1, output_train2)}")

# Evaluation mode (dropout inactive—deterministic)
model.eval()
output_eval1 = model(x)
output_eval2 = model(x)
print(f"Eval outputs identical (no dropout): {torch.allclose(output_eval1, output_eval2)}")

Nested modules and composition

Build complex architectures by nesting modules, creating a hierarchical structure where each module manages its sub-components.

Building modular architectures

import torch
import torch.nn as nn

# Reusable building block
class ConvBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super(ConvBlock, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
self.bn = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)

def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
return x

# Use the block in a larger model
class SimpleConvNet(nn.Module):
def __init__(self):
super(SimpleConvNet, self).__init__()
self.block1 = ConvBlock(3, 32)
self.block2 = ConvBlock(32, 64)
self.pool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(64, 10)

def forward(self, x):
x = self.block1(x)
x = self.block2(x)
x = self.pool(x)
x = x.view(x.size(0), -1) # Flatten
x = self.fc(x)
return x

model = SimpleConvNet()
x = torch.randn(4, 3, 32, 32) # Batch of 4, 3 channels, 32x32 images
output = model(x)
print(f"Output shape: {output.shape}") # [4, 10]

Key Takeaways

  • nn.Module is PyTorch's foundation for neural networks; all custom models inherit from it and implement a forward() method.
  • Define layers as instance attributes (e.g., self.linear = nn.Linear(...)) so PyTorch automatically registers them for parameter tracking and gradient computation.
  • Use nn.Sequential for simple linear stacks or custom forward() methods for flexible, reusable architectures.
  • Access model parameters via .parameters() or .named_parameters() to inspect, freeze, or count them.
  • Move entire models to devices with .to(device) or .cuda(); always ensure input data is on the same device as the model.
  • Call model.train() during training to enable dropout and batch norm, and model.eval() during evaluation for deterministic predictions.

Frequently Asked Questions

Why do I need to define layers in __init__ instead of in forward()?

Layers defined in __init__ are registered as sub-modules, so PyTorch tracks their parameters and applies .to(device) and .train() recursively. Layers defined in forward() are created anew each iteration, aren't registered, and break parameter tracking and device movement.

How do I apply the same operation to multiple inputs?

Use torch.cat() to batch inputs together before passing through the model, or use for loops over batches. For true parallelism across GPUs, use nn.DataParallel or nn.parallel.DistributedDataParallel.

What does inplace=True do in activation functions like ReLU?

inplace=True modifies the tensor in place, saving memory but can cause issues in models with skip connections or when gradients are needed for intermediate values. Use inplace=True only when memory is critical and the architecture supports it.

How can I initialize weights to specific values?

Loop over parameters and assign: for param in model.parameters(): nn.init.xavier_uniform_(param). PyTorch provides nn.init functions like xavier_uniform_, normal_, and constant_ for common initialization schemes.

Can I conditionally include or exclude layers based on input?

Yes, use Python conditionals in forward(): if self.use_dropout: x = self.dropout(x). This is valid because PyTorch's dynamic graphs allow architecture variations per forward pass, unlike static graph frameworks.

Further Reading