Regularization In Neural Networks

Table of Content
- Background
- What is Overfitting?
- Lasso (L1) and Ridge (L2) Regularisation
- Early Stopping
- Dropout
- Other Methods
- Summary
Background
So far in this neural networks 101 series we have discussed two ways to improve the performance of neural networks: hyperparameter tuning and faster gradient descent optimisers. You can check those posts below:
There is one other set of techniques that aid in performance and that is regularisation. This helps prevent the model from overfitting to the training dataset to have more accurate and consistent predictions.
In this article, we will cover a wide range of methods to regularise your neural network and how you can do it in PyTorch!
What is Overfitting?
Let's quickly recap of what we mean by overfitting in Machine Learning and statistics.
Wikipedia describes overfitting as:
"The production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably"
In layman's terms, this is saying that the model is learning the data it is training on, but is failing to generalise. Therefore, it will have poor predictions on data it has not seen before.
Below is a visual example depicting a _null model (under fit), a proposed model (good generalisation) and a saturated_ model (overfit):

Notice that the overfit (saturated) model goes through every data point (its ‘connecting-the-dots'), so it fits directly into the data. Whereas the proposed model has clearly generalised a lot better even though its line is not going through every data point.
The code used to generate the above plot is available on my GitHub here:
Medium-Articles/Statistics/General/saturated_models.py at main · egorhowell/Medium-Articles
Lasso & Ridge Regularisation
**[Lasso](https://en.wikipedia.org/wiki/Lasso%28statistics%29) and Ridge regularisation can be similarly used for neural networks to how they are applied to linear regression**_. They apply an additional penalty term to the loss function to help keep the model weights small or sparse to encourage simpler models to reduce the chance of overfitting.
Lasso
For Lasso regularisation (L1), the penalty term is the sum of the absolute weights used in the model:

Where:
- λ: The regularisation parameter
- Original Loss: The initial loss without taking into account the regularisation terms.
- _w_i:_ The model's weights
Lasso can cause some weights to become zero, creating a sparser neural network. This curtails the complexity of the network.
Lasso is not available directly in PyTorch, but we can add it by editing the loss function inside the code:
import torch
# Define L1 regularisation
l1_lambda = 0.01
# Training loop for the model
for input, target in data_loader:
optimizer.zero_grad()
output = model(input)
loss = loss_function(output, target)
# Calculate L1 penalty
l1_penalty = torch.tensor(0.).to(input.device)
for param in model.parameters():
l1_penalty += torch.sum(torch.abs(param))
# Add L1 penalty to the loss
loss += l1_lambda * l1_penalty
# Backward pass and optimize
loss.backward()
optimizer.step()
Ridge
Ridge regularisation adds the square of the weights as the penalty term:

The terms in this equation are the same as above for Lasso regularisation.
The difference to Lasso regularisation is the squaring of the weights. This leads to the weights not being zero but does minimise their value, thus helping in overfitting.
Ridge regularisation is much easier to implement than Lasso inside PyTorch. It is done by specifying the weight_decay
argument which is the regularisation strength:
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
If weight_decay
is too small, then there will be minimum regularisation. Therefore, it must be set initialised correctly. This can be achieved through trial and error or using hyperparameter tuning.
Early Stopping
Early stopping is probably the best regularisation method for neural networks and machine learning in general.
Early stopping measures the performance on an external validation set while the model is "learning." If the performance on the validation set improves each epoch, then the neural network continues learning on the training data.
However, if the performance on the validation set doesn't improve for a certain number of epochs, typically referred to as patience, then training is terminated early.
The validation set allows us to evaluate the model on a hold-out dataset that is not used to train the model. This is how early stopping helps with any potential overfitting problem.
Some research shows that a neural network can generalise even if the performance on the validation set starts to degrade. This is known as double descent or grokking, and highly recommend checking this out as it is such a fascinating result.
Below is an example of how you can implement early stopping on the famous _iris dataset (MIT License) using PyTorch_:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.input_layer = nn.Linear(4, 10)
self.output_layer = nn.Linear(10, 3)
def forward(self, x):
x = torch.relu(self.input_layer(x))
x = self.output_layer(x)
return x
# Training function
def train_one_epoch(model, data_loader, optimiser, criterion):
model.train()
for inputs, targets in data_loader:
optimiser.zero_grad()
preds = model(inputs)
loss = criterion(preds, targets)
loss.backward()
optimiser.step()
# Validation function
def validate(model, data_loader, criterion):
model.eval()
total_loss = 0
with torch.no_grad():
for inputs, targets in data_loader:
preds = model(inputs)
loss = criterion(preds, targets)
total_loss += loss.item()
return total_loss / len(data_loader)
# Main Training Function with Early Stopping
def train_model(model, train_loader, val_loader, optimiser, criterion, epochs, patience):
best_val_loss = float('inf')
epochs_no_improve = 0
train_losses = []
val_losses = []
early_stop = 0
for epoch in range(epochs):
train_loss = 0
for inputs, targets in train_loader:
optimiser.zero_grad()
preds = model(inputs)
loss = criterion(preds, targets)
loss.backward()
optimiser.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
# Get the validation dataset loss
val_loss = validate(model, val_loader, criterion)
val_losses.append(val_loss)
# Early stopping check
if val_loss < best_val_loss:
best_val_loss = val_loss
epochs_no_improve = 0
else:
epochs_no_improve += 1
if epochs_no_improve == patience:
early_stop = epoch + 1
break
# Plot the early stopping
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, epochs)), y=train_losses, mode='lines', name='Training Loss'))
fig.add_trace(go.Scatter(x=list(range(1, epochs)), y=val_losses, mode='lines', name='Validation Loss'))
if early_stop:
fig.add_vline(x=early_stop, line_width=3, line_dash="dash", line_color="red")
fig.add_annotation(x=early_stop, y=max(max(train_losses), max(val_losses)),
text="Early Stopping", showarrow=True, arrowhead=1, ax=-50, ay=-100)
fig.update_layout(title='Early Stopping Example', xaxis_title='Epoch', yaxis_title='Loss', template='plotly_white',
width=900, height=600, font=dict(size=18), xaxis=dict(tickfont=dict(size=16)),
yaxis=dict(tickfont=dict(size=16)), title_font_size=24)
fig.show()
return train_losses, val_losses
# Load and split the data
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalise the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
# Convert the data into PyTorch Tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.long)
# Load the data into PyTorch DataLoaders to allow mini-batch training
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)
# Model initialisation
model = Model()
optimiser = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Train and visualise results
train_losses, val_losses = train_model(model, train_loader, val_loader, optimiser, criterion, epochs=800, patience=10)

As you can see, training terminated at ~260 epochs, despite setting training to 800 epochs, as the performance on the validation set didn't improve for 10 epochs.
The code for the above plot is available on my GitHub here:
Medium-Articles/Neural Networks/regularisation.py at main · egorhowell/Medium-Articles
Dropout
**[Dropout](https://en.wikipedia.org/wiki/Dilution%28neuralnetworks%29)** is one of the most famous regularisation techniques introduced by the ‘godfather' of deep learning, _Geoffrey Hinton_. It has been shown to improve the performance of state-of-the-art neural networks by a couple of percentage points.
The idea behind is drop is quite simple. At every epoch, each neuron has some probability p of being "dropped" from the learning process and is ignored. However, for the next epoch, it may be "active" and continue learning its optimal weights and biases.
Note: the output neurons are not considered for dropout.
The probability of dropout happening, p, is a hyperparameter that can and should be hyperparameter-tuned for the network you are considering. In general, it ranges from 10–50% depending on the type of neural network you are building. Types include _recurrent and convolutional_ neural networks.
The diagram below illustrates the dropout technique for a three-layer network:

The reason dropout is so effective is that it teaches neurons to be useful on their own and not co-adapt with neighboring neurons. This makes them generalise better as they consider their inputs more sensitively.
Another nice way of thinking about it is that dropout leads to us training several different neural networks. If our network has n neurons, then we have 2^n permutations of networks as each neuron has two states: "active" or "dropped." Therefore, after 1,000 epochs, we have trained 1,000 neural networks. The final model is just an average of all these smaller networks.
Dropout is easily added in PyTorch when declaring the architecture of the network:
import torch
import torch.nn as nn
from torch.nn.functional import relu
class NerualNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size, dropout_rate):
super(NerualNet, self).__init__()
self.input_layer = nn.Linear(input_size, hidden_size)
self.dropout = nn.Dropout(dropout_rate)
self.output_layer = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = relu(self.input_layer(x))
x = self.dropout(x)
x = self.output_layer(x)
return x
# Example: Network with 100 input features, 10 hidden units
# 2 output classes, and 20% dropout rate
model = NerualNet(100, 10, 2, 0.2)
Other Methods
Architecture
You can reduce the number of hidden layers and the number of neurons in these layers to reduce complexity, hence curtailing the chance of overfitting.
More Data
As always, the more data you have, the better. Having more training rows for your model to learn from, leads to the neural network finding the best weights and biases much more likely.
Augment Data
Particularly for computer vision tasks, you can augment the data using random transformations (flip, rotate, sheer, etc.) to increase the pool of training data.
Summary & Further Thoughts
Regularisation is an important concept to get right for your neural network model to prevent it from overfitting on the training data. The main methods I recommend to add to your neural network to regularise it are early stopping and dropout. The coupling of these two is very effective in reducing the chance of overfitting.
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.