The Math Behind Gated Recurrent Units

Gated Recurrent Units (GRUs) are a powerful type of recurrent neural network (RNN) designed to handle sequential data efficiently. In this article, we'll explore what GRUs are, and their math, and we will build an implementation from scratch in Python.
Index
1: Understanding the Basics ∘ 1.1: What are GRUs? ∘ 1.2: Comparison with LSTMs and Vanilla RNNs 2: The Architecture of GRUs ∘ 2.1: GRU Cell Structure 3: The Mathematics of GRUs ∘ 3.1: Update Gate ∘ 3.2: Reset Gate ∘ 3.3: Candidate Activation ∘ 3.4: Step-by-Step Example 4: Building GRU From Scratch in Python ∘ 4.1: Imports and Custom Classes ∘ 4.2: GRU Class ∘ 4.3: GRUTrainer Class ∘ 4.4: TimeSeriesDataset Class ∘ 4.5: Training the GRU Model 5: Conclusion
1: Understanding the Basics
1.1: What are GRUs?
GRUs, or Gated Recurrent Units, are a type of recurrent neural network (RNN) introduced by Kyunghyun Cho and his colleagues in 2014. Think of GRUs as a smarter version of traditional RNNs, designed to handle sequences of data more effectively.
Imagine you're trying to learn a song by listening to it repeatedly. A basic RNN might forget the beginning of the song by the time it gets to the end. GRUs solve this problem by using gates that control what information is remembered and what is forgotten.
GRUs simplify the structure of Long Short-Term Memory (LSTM) networks by merging the input and forget gates into a single update gate and adding a reset gate. This makes them faster to train and easier to work with, while still keeping the ability to remember important information for a long time.
Update Gate: This gate decides how much of the past information should be carried forward to the future.
Reset Gate: This gate determines how much of the past information to forget.
These gates help GRUs maintain a balance between remembering important details and forgetting unimportant ones, similar to how you might focus on remembering the melody of a song while ignoring the background noise.
GRUs are great for tasks where data comes in sequences, like predicting the stock market, understanding language, or even generating music. They can learn patterns in data by keeping track of past information and using it to make better predictions. This makes them incredibly useful for any application where understanding the context from previous data points is crucial.
1.2: Comparison with LSTMs and Vanilla RNNs
To understand where GRUs fit in, let's compare them with LSTMs and Vanilla RNNs.
Vanilla RNNs Think of Vanilla RNNs as the basic version of recurrent neural networks. They work by passing information from one time step to the next, like a relay race where each runner passes the baton to the next. However, they have a big flaw: they tend to forget things over long sequences. This is due to the vanishing gradient problem, which makes it hard for them to learn long-term dependencies in data.
Here's my previous article about Recurrent Neural Networks, where we build them from scratch:
LSTMs Long Short-Term Memory Networks were designed to fix this problem. They use a more complex structure with three types of gates: input, forget, and output gates. These gates act like a sophisticated filing system, deciding what information to keep, what to update, and what to discard. This allows LSTMs to remember important information for long periods, making them great for tasks where context over many time steps is crucial, like understanding paragraphs of text or recognizing patterns in long time series.
Here's my previous article about LSTM:
GRUs Gated Recurrent Units are a streamlined version of LSTMs. They simplify things by combining the input and forget gates into a single update gate, and they also have a reset gate. This makes GRUs less computationally intensive and faster to train than LSTMs, while still being able to handle long-term dependencies effectively.
2: The Architecture of GRUs

2.1: GRU Cell Structure
Now, let's take a look inside a single GRU cell. A GRU cell can be thought of as a small control unit in a factory that decides how much of the old information to keep and how much new information to add. It does this through two main components: the update gate and the reset gate. Let's explore these two concepts in more detail.
Update Gate: This gate acts like a smart filter. It determines how much of the past information should be carried forward. If the update gate decides a piece of information is important, it will keep it; otherwise, it will discard it.
Reset Gate: This gate decides how much of the past information to forget. When the reset gate is activated, it allows the cell to ignore parts of the past data, which helps in focusing on new, relevant information.
Each GRU cell takes in the current input and the previous hidden state (like a memory of what happened before). The update gate is calculated using the current input and the previous hidden state. This helps the cell decide how much of the past information to keep.
The reset gate is calculated similarly, using the current input and the previous hidden state. This gate helps decide how much of the past information to forget.
With the reset gate applied, a candidate for the new hidden state is created by combining the current input and the previous state. This candidate represents the potential new information that could be added to the hidden state.
The update gate then decides the final hidden state by blending the previous hidden state and the candidate activation. This blend ensures that important past information is retained while incorporating relevant new information.
3: The Mathematics of GRUs
In this section, we will go over the math behind GRU. I suggest you have the image at the top of the article displaying the GRU architecture, making it easier to see how the data flows and get transformed.
3.1: Update Gate
As we said before, the update gate determines how much of the past information should be carried forward to influence the current state. This gate helps the GRU retain important information over long sequences, making it effective for handling sequential data.
The update gate _zt is calculated using the current input _xt and the previous hidden state _ht−1. The formula for the update gate is:

Here: