Paper Walkthrough: Attention Is All You Need
Introduction
As the title suggests, in this article I am going to implement the Transformer architecture from scratch with PyTorch – yes, literally from scratch. Before we get into it, let me provide a brief overview of the architecture. Transformer was first introduced in a paper titled "Attention Is All You Need" written by Vaswani et al. back in 2017 [1]. This neural network model is designed to perform seq2seq (Sequence-to-Sequence) tasks, where it accepts a sequence as the input and is expected to return another sequence for the output such as machine translation and question answering.
Before Transformer was introduced, we usually used RNN-based models like LSTM or GRU to accomplish seq2seq tasks. These models are indeed capable of capturing context, yet they do so in a sequential manner. This approach makes it challenging to capture long-range dependencies, especially when the important context is very far behind the current timestep. In contrast, Transformer can freely attend any parts of the sequence that it considers important without being constrained by sequential processing.
Transformer Components
The main Transformer architecture can be seen in the Figure 1 below. It might look a bit intimidating at first, but don't worry – I am going to explain the entire implementation as complete as possible.

You can see in the figure that Transformer comprises of many components. The large block on the left is called Encoder, while the one on the right is called Decoder. In the case of machine translation, for example, the Encoder is responsible for capturing the pattern of the original sentence, whereas the Decoder is employed to generate the corresponding translation.
The ability of the Transformer to freely attend to specific words is due to the presence of the Multihead Attention block, in which it works by comparing each word with every other word within the sequence. It is important to note that the three Multihead Attention blocks (highlighted in orange) are not exactly the same despite their similar purpose. Nevertheless, while the attention mechanism captures the relationships between words, it does not account for the sequence of the words itself, which is actually very crucial in NLP. Thus, to retain sequence information, we employ the so-called Positional Encoding.
I think the remaining components of the network are pretty straighforward: Add & Norm block (colored in yellow) is basically an addition followed by normalization operation, Feed Forward (blue) is just a linear layer, Input & Output Embedding (red) are used to convert input words into vectors, Linear block after the Decoder (purple) is another standard linear layer, and Softmax (green) is the layer responsible for generating a probability distribution over the vocabulary to predict the next word.
Imports and Configurations
Now let's actually start coding by importing the required modules: the base torch
module for basic functionalities, the nn
submodule for initializing neural network layers, and the summary()
function from torchinfo
which I will use to display the details of the entire Deep Learning model.
# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import summary
Afterwards, I am going to initialize the parameters for the Transformer model. The first parameter is SEQ_LENGTH
which the value is set to 200 (marked with #(1)
in Codeblock 2 below). This is essentially done because we want the model to capture a sequence of exactly 200 tokens. If the sequence is longer, it will be truncated. Meanwhile, if it has fewer than 200 tokens, padding will be applied. By the way the term token itself does not necessarily correspond to a single word, as each word can actually be broken down into several tokens. However, we will not talk about these kinds of preprocessing details here, as the main goal of this article is to implement the architectural design. In this particular case we assume that the sequence has already been preprocessed and is ready to be fed into the network. The subsequent parameters are VOCAB_SIZE_SRC
(#(2)
) and VOCAB_SIZE_DST
(#(3)
), in which the former denotes the number of unique tokens possible to appear in the original sequence, while the latter is the same thing but for the translated sequence. It is worth noting that the numbers for these parameters are chosen arbitrarily. In practice, sequence lengths can range from a few hundred to several thousand tokens, while the vocabulary size typically range from tens of thousands to a few hundred thousand tokens.
# Codeblock 2
SEQ_LENGTH = 200 #(1)
VOCAB_SIZE_SRC = 100 #(2)
VOCAB_SIZE_DST = 120 #(3)
BATCH_SIZE = 1 #(4)
D_MODEL = 512 #(5)
NUM_HEADS = 8 #(6)
HEAD_DIM = D_MODEL//NUM_HEADS # 512 // 8 = 64 #(7)
HIDDEN_DIM = 2048 #(8)
N = 6 #(9)
DROP_PROB = 0.1 #(10)
Still with Codeblock 2, here I set the BATCH_SIZE
to 1 (#(4)
). You can actually use any number for the batch size since it does not affect the model architecture at all. The D_MODEL
and NUM_HEADS
parameters on the other hand, are something that you cannot choose arbitrarily, in a sense that D_MODEL
(#(5)
) needs to be divisible by NUM_HEADS
(#(6)
). The D_MODEL
itself corresponds to the model dimension, which is actually also equivalent to the embedding dimension. This notion implies that every single token is going to be represented as a vector of size 512. Meanwhile, NUM_HEADS=8
means that there will be 8 heads inside a Multihead Attention layer. Later on, the 512 features of each token will be spread evenly into these 8 attention heads, so every single head will be responsible for handling 64 features (HEAD_DIM
) as marked at line #(7)
. The HIDDEN_DIM
parameter, which the value is set to 2048 (#(8)
), denotes the number of neurons in the hidden layer inside the Feed Forward blocks. Next, if you go back to Figure 1, you will notice that there is a symbol N× next to the Encoder and the Decoder which essentially means that we can stack them N times. In this case, we set it to 6 as marked at line #(9)
. Lastly, we can also control the rate of the dropout layers through the DROP_PROB
parameter (#(10)
).
In fact, all the parameter values I set above are taken from the base configuration of the Transformer model shown in the figure below.

Input & Output Embedding
As all parameters have been initialized, we will jump into the first component: the Input and Output Embedding. The purpose of the two are basically the same, namely to convert each token in the sequence into its corresponding 512 (D_MODEL
)-dimensional vector representation. What makes them different is that Input Embedding processes the tokens from the original sentence, whereas Output Embedding does the same thing for the translated sentence. The Codeblock 3 below shows how I implement them.
# Codeblock 3
class InputEmbedding(nn.Module):
def __init__(self):
super().__init__()
self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE_SRC, #(1)
embedding_dim=D_MODEL)
def forward(self, x):
print(f"originalt: {x.shape}")
x = self.embedding(x) #(2)
print(f"after embeddingt: {x.shape}")
return x
class OutputEmbedding(nn.Module):
def __init__(self):
super().__init__()
self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE_DST, #(3)
embedding_dim=D_MODEL)
def forward(self, x):
print(f"originalt: {x.shape}")
x = self.embedding(x) #(4)
print(f"after embeddingt: {x.shape}")
return x
The InputEmbedding()
and OutputEmbedding()
classes above appear to be identical. However, if you take a closer look at the nn.Embedding()
layer from the two classes (at the line marked with #(1)
and #(3)
), you will see that in InputEmbedding()
I set the num_embeddings
parameter to VOCAB_SIZE_SRC
(100), while in OutputEmbedding()
we set it to VOCAB_SIZE_DST
(120). This approach allows us to handle two languages that have different vocabulary sizes, where in this case we assume that the source language and the destination language have the number of unique tokens of 100 and 120, respectively. Next, the forward()
method of the two classes is completely the same, in which it works by accepting a sequence of tokens and return the result produced by the self.embedding()
layer (#(2)
and #(4)
). Here I also print out the dimension of the tensor before and after processing so you can better understand how the tensors are actually processed.
To check whether our code is working properly, we can test it by passing a dummy tensor through the network. In the Codeblock 4 below, we first initialize the InputEmbedding()
layer (#(1)
) followed by a batch of single-dimensional array (#(2)
). This array is generated using torch.randint()
, which I configure to produce a sequence of random integers ranging from 0 to VOCAB_SIZE_SRC
(100) with the length of SEQ_LENGTH
(200). Afterwards, we can just pass the x_src
tensor through the input_embedding
layer (#(3)
).
# Codeblock 4
input_embedding = InputEmbedding() #(1)
x_src = torch.randint(0, VOCAB_SIZE_SRC, (BATCH_SIZE, SEQ_LENGTH)) #(2)
x_src = input_embedding(x_src) #(3)
You can see in the output that the sequence which initially has the length of 200 now becomes 200×512. This indicates that our InputEmbedding()
class successfully converted a sequence of 200 tokens into a sequence of 200 vectors with 512 dimensions each.
# Codeblock 4 output
original : torch.Size([1, 200])
after embedding : torch.Size([1, 200, 512])
We can also test our OutputEmbedding()
class in the exact same way as shown in Codeblock 5.
# Codeblock 5
output_embedding = OutputEmbedding()
x_dst = torch.randint(0, VOCAB_SIZE_DST, (BATCH_SIZE, SEQ_LENGTH))
x_dst = output_embedding(x_dst)
# Codeblock 5 output
original : torch.Size([1, 200])
after embedding : torch.Size([1, 200, 512])
In addition to the Output Embedding layer, you can see back in Figure 1 that it accepts the shifted right outputs as its input. This basically means that the current token in the original sentence corresponds to the next token in the translated sentence (i.e., the token at the subsequent timestep). This shifting is necessary to be done because the first position in the translated sentence is reserved for the so-called start token, which signals to the network that it is the beginning of the sentence to generate. However – as I have mentioned earlier, we are not going to get deeper into such a preprocessing step. Here we assume that the x_dst
tensor passed through the output_embedding
layer in the Codeblock 5 above already includes the start token. See the Figure 3 below to better understand this idea. In this example, the sequence on the left is a sentence in English, and the sequence on the right is the corresponding shifted-right output in Indonesian.

Positional Encoding
As raw token sequence has been processed with Input and Output Embedding layers, we are going to inject positional encoding into them. The plus symbol in the architecture indicates that the operation is done by performing element-wise addition between the positional encoding values and the tensor produced by Input and Output Embedding. See the zoomed-in version of the Transformer model in Figure 4 below.

According to the original paper, positional encoding is defined by the following equation, where pos is the current position in the sequence axis and i is the index of the element in the 512 (D_MODEL
)-dimensional token vector.

The above equation looks scary at glance, but the idea is actually pretty simple. For each embedding dimension (the D_MODEL
-dimensional vector), we create a sequence of numbers ranging from -1 to 1, following a sine and cosine wave patterns along the sequence axis. The illustration for this is shown in Figure 6 below.

The lines drawn in orange indicate sine waves, while the ones in green are cosine waves. The wave value that lies at a token in a specific embedding dimension is going to be taken and summed up with the corresponding embedding tensor value. Furthermore, notice that the embed dim of even numbers (0, 2, 4, …) as well as the embed dim of odd numbers (1, 3, 5, …) are using sine and cosine patterns alternately with a decreasing frequency as we move from left to right across the embedding dimensions. By doing all these things, we allow the model to preserve information regarding the position of all tokens.
The implementation of this concept is done in the PositionalEncoding()
class which you can see in Codeblock 6.
# Codeblock 6
class PositionalEncoding(nn.Module):
def forward(self):
pos = torch.arange(SEQ_LENGTH).reshape(SEQ_LENGTH, 1) #(1)
print(f"postt: {pos.shape}")
i = torch.arange(0, D_MODEL, 2) #(2)
denominator = torch.pow(10000, i/D_MODEL) #(3)
print(f"denominatort: {denominator.shape}")
even_pos_embed = torch.sin(pos/denominator) #(4)
odd_pos_embed = torch.cos(pos/denominator) #(5)
print(f"even_pos_embedt: {even_pos_embed.shape}")
stacked = torch.stack([even_pos_embed, odd_pos_embed], dim=2) #(6)
print(f"stackedtt: {stacked.shape}")
pos_embed = torch.flatten(stacked, start_dim=1, end_dim=2) #(7)
print(f"pos_embedt: {pos_embed.shape}")
return pos_embed
The above code might seem somewhat unusual since I directly go with the forward()
method – omitting the __init__()
method, which is typically included when working with Python classes. This is essentially done because there is no neural network layer need to be instantiated when a PositionalEncoding()
object is initialized. The configuration parameters to be used themselves are defined as global variables (i.e., SEQ_LENGTH
and D_MODEL
), thus can directly be used inside the forward()
method.
All processes done in the forward pass encapsulates the equation shown in Figure 5. The pos
variable I create at line #(1)
corresponds to the same thing in the equation, in which it is essentially just a sequence of numbers from 0 to SEQ_LENGTH
(200). I want these numbers to span along the SEQ_LENGTH
axis in the embedding tensor, so I add a new axis using the reshape()
method. Next, at line #(2)
I initialize the i
array with values ranging from 0 to D_MODEL
(512) with the step of 2. Hence, there will only be 256 numbers generated. The reason that I do this is because in the subsequent step I want to use the i
array twice: one for the even embedding dimension and another one for the odd embedding dimension. However, the i
array itself is not going to be used for the two directly, rather, we will employ it to compute the entire denominator in the equation (#(3)
) before eventually being used for creating the sine (#(4)
) and cosine waves (#(5)
). At this point we have already had two positional embedding tensors: even_pos_embed
and odd_pos_embed
. What we are going to do next is to combine them such that the resulting tensor will have alternating sine and cosine pattern as shown back in Figure 6. This can be achieved using a little trick that I do at line #(6)
and #(7)
.
Next, we will run the following code to test if our PositionalEncoding()
class works properly.
# Codeblock 7
positional_encoding = PositionalEncoding()
positional_embedding = positional_encoding()
# Codeblock 7 output
pos : torch.Size([200, 1])
denominator : torch.Size([256])
even_pos_embed : torch.Size([200, 256]) #(1)
odd_pos_embed : torch.Size([200, 256]) #(2)
stacked : torch.Size([200, 256, 2])
pos_embed : torch.Size([200, 512]) #(3)
Here I print out every single step in the forward()
function so that you can see what is actually going on under the hood. The main idea of this process is that, once you get the even_pos_embed
(#(1)
) and odd_pos_embed
(#(2)
) tensors, what you need to do afterwards is to merge them such that the resulting dimension becomes 200×512 as shown at line #(3)
in the Codeblock 7 output. This dimension exactly matches with the size of the embedding tensor we discussed in the previous section (SEQ_LENGTH
×D_MODEL
), allowing element-wise addition to be performed.
The Attention Mechanism
The three Multihead Attention blocks highlighted in orange in Figure 1 shares the same basic concept, hence they all have the same structure. The following figure shows what the components inside a single Multihead Attention block look like.

The Scaled Dot-Product Attention block (purple) in the above figure itself also comprises of several other components which you can see in the illustration below.


Let's now break all these things down one by one.
Scaled Dot-Product Attention
I am going to start with the Scaled Dot-Product Attention block first. In Codeblock 8 implement it inside the Attention()
class.
# Codeblock 8
class Attention(nn.Module):
def create_mask(self): #(1)
mask = torch.tril(torch.ones((SEQ_LENGTH, SEQ_LENGTH))) #(2)
mask[mask == 0] = -float('inf')
mask[mask == 1] = 0
return mask.clone().detach()
def forward(self, q, k, v, look_ahead_mask=False): #(3)
print(f"qttt: {q.shape}")
print(f"kttt: {k.shape}")
print(f"vttt: {v.shape}")
multiplied = torch.matmul(q, k.transpose(-1,-2)) #(4)
print(f"multipliedtt: {multiplied.shape}")
scaled = multiplied / torch.sqrt(torch.tensor(HEAD_DIM)) #(5)
print(f"scaledttt: {scaled.shape}")
if look_ahead_mask == True: #(6)
mask = self.create_mask()
print(f"maskttt: {mask.shape}")
scaled += mask #(7)
attn_output_weights = torch.softmax(scaled, dim=-1) #(8)
print(f"attn_output_weightst: {attn_output_weights.shape}")
attn_output = torch.matmul(attn_output_weights, v) #(9)
print(f"attn_outputtt: {attn_output.shape}")
return attn_output, attn_output_weights #(10)
Similar to the PositionalEncoding()
class in the previous section, here I also omit the __init__()
method since there are no neural network layers to be implemented. If you take a look at Figure 8, you will see that this block only comprises standard mathematical operations.
The Attention()
class initially works by capturing four inputs: query (q
), key (k
), value (v
), and a boolean parameter of look_ahead_mask
as written at line #(3)
. The query, key and value are basically three different tensors yet having the exact same shape. In this case, their dimensions are all 200×64, where 200 is the sequence length while 64 is the head dimension. Remember that the value of 64 for the HEAD_DIM
is obtained by dividing D_MODEL
(512) by NUM_HEADS
(8). Based on this notion, now you know that the Attention()
class implemented here basically contains the operations done within every single of the 8 attention heads.
The first process to be done inside the Scaled Dot-Product Attention block is matrix multiplication between query and key (#(4)
). Remember that we need to transpose the key matrix so that its dimension becomes 64×200, allowing it to be multiplied with the query which the dimension is 200×64. The idea behind this multiplication is to compute the relationship between each token and all other tokens. The output of this multiplication operation is commonly known as unnormalized attention scores or attention logits, where the variance of the elements inside this tensor is still high. To scale these values, we divide the tensor by the square root of the head dimension (√64), resulting in the scaled attention scores (#(5)
). The actual attention weights tensor is then obtained after we pass it through a softmax function (#(8)
). Lastly, this attention weights tensor is then multiplied with value (#(9)
). – And here is where the magic happens: the v
tensor, which is initially just a sequence of 64-dimensional token vectors, now becomes context-aware. This essentially means that each token vector is now enriched with information about the relationships between tokens, leading to a better understanding of the entire sentence. Finally, this forward()
method will return both the context-aware token sequence (attn_output
) and the attention weights (attn_output_weights
) as written at line #(10)
.
Look-Ahead Mask
One thing that I haven't explained regarding the Codeblock 8 above is the create_mask()
function (#(1)
). The purpose of this function is to generate the so-called look-ahead mask, which is used such that the model won't be able to attend the subsequent words – hence the name look-ahead. This mask will later be implemented inside the first Multihead Attention block in the Decoder (the Masked Multi-Head Attention block, see Figure 1). The look-ahead mask itself is basically a square matrix with the height and width of SEQ_LENGTH
(200) as written at line #(2)
. Since it is not feasible to draw a 200×200 matrix, here I give you an illustration of the same thing for a sequence of 7 tokens only.

As you can see the above figure, the look-ahead mask is essentially a triangular matrix, in which its lower part is filled with zeros, while the upper part is filled with -inf (negative infinity). At this point you need to remember the property of a softmax function: a very small value passed through it will be mapped to 0. Based on this fact, we can think of these -inf values as a mask which won't allow any information to get passed through since it will eventually cause the weight matrix to pay zero attention to the corresponding token. By using this matrix, we essentially force a token to only pay attention to itself and to the previous tokens. For example, token 3 (from the Query axis) can only pay attention to token 3, 2, 1 and 0 (from the Key axis). This technique is very effective to be used during the training phase to ensure that the Decoder doesn't rely on future tokens as they are unavailable during the inference phase (since tokens will be generated one by one).
Talking about the implementation, the create_mask()
function will only be called whenever the look_ahead_mask
parameter is set to True
(#(6)
in Codeblock 8). Afterwards, the resulting mask is applied to the scaled attention scores tensor (scaled
) by element-wise addition (#(7)
). With this operation, any numbers in the scaled
tensor summed with 0 will remain unchanged, whereas the numbers masked with -inf will also become -inf, causing the output after being softmax-ed to be 0.
As always, to check whether our Scaled Dot-Product Attention mechanism and the masking process are working properly, we can run the following codeblock.
# Codeblock 9
attention = Attention()
q = torch.randn(BATCH_SIZE, SEQ_LENGTH, HEAD_DIM)
k = torch.randn(BATCH_SIZE, SEQ_LENGTH, HEAD_DIM)
v = torch.randn(BATCH_SIZE, SEQ_LENGTH, HEAD_DIM)
attn_output, attn_output_weights = attention(q, k, v, look_ahead_mask=True)
# Codeblock 9 output
q : torch.Size([1, 200, 64])
k : torch.Size([1, 200, 64])
v : torch.Size([1, 200, 64])
multiplied : torch.Size([1, 200, 200]) #(1)
scaled : torch.Size([1, 200, 200])
mask : torch.Size([200, 200])
attn_output_weights : torch.Size([1, 200, 200])
attn_output : torch.Size([1, 200, 64]) #(2)
In the Codeblock 9 output above, we can see that the multiplication between q
(200×64) and transposed k
(64×200) results in a tensor of size 200×200 (#(1)
). The scaling operation, mask application, and the processing with softmax function do not alter this dimension. The tensor eventually changes back to the original q
, k
, and v
size (200×64) after we perform matrix multiplication between attn_output_weights
(200×200) and v
(200×64), with the result now stored in attn_output
variable (#(2)
).
Multihead Self-Attention
The Scaled Dot-Product Attention mechanism we just discussed is actually the core of a Multihead Attention layer. In this section, we are going to discuss how to implement it inside the so-called Multihead Self-Attention Layer. The reason that it is named self is essentially because the query, key and value to be fed into are all derived from the same sequence. The two Attention blocks in the Transformer architecture that implement the Self-Attention mechanism can be seen in Figure 11. We can see here that the three arrows coming into both blocks are from the same source.

Generally speaking, the objective of a Self-Attention layer is to capture the context (relationship between words) from the same sequence. In the case of machine translation, Self-Attention block in the Encoder (left) is responsible to do so for the sentence in the original language, whereas the one in the Decoder (right) is for the sentence in the destination language. Previously I've mentioned that we need to implement look-ahead mask to the first Attention block in the Decoder. This is essentially because later in the inference phase, the Decoder will work by returning a single word at a time. Hence, during the training phase, the mask will prevent the model from attending to subsequent words. In contrast, the Encoder accepts the entire sequence at once both in the training and inference phase. Thus, we should not apply the look-ahead mask here since we want the model to capture the context based on the entire sentence, not only based on the previous and current tokens.
Look at the Codeblock 10 below to see how I implement the Self-Attention block. Remember that it is actually created based on the diagram in Figure 7.
# Codeblock 10
class SelfAttention(nn.Module):
def __init__(self, look_ahead_mask=False): #(1)
super().__init__()
self.look_ahead_mask = look_ahead_mask
self.qkv_linear = nn.Linear(D_MODEL, 3*D_MODEL) #(2)
self.attention = Attention() #(3)
self.linear = nn.Linear(D_MODEL, D_MODEL) #(4)
I want the SelfAttention()
class above to be flexible, so that we can use it either with or without a mask. To do so, I define the look_ahead_mask
parameter which by default is set to False (#(1)
). Next, there will be two linear layers in this class. The first one is going to be placed before the Scaled Dot-Product Attention operation (#(2)
), and the second one is placed after it (#(4)
). Notice that the first linear layer (self.qkv_linear
) is set to accept an input tensor of size D_MODEL
(512) and return another tensor having the size of 3 times larger than the input (3 × 512 = 1536). This essentially means that every single token which is initially represented as 512-dimensional vector, now becomes 1536-dimensional. The idea behind this operation is that we want to allocate 512-dimensional vectors for each of the query
, key
and value
later in the Scaled Dot-Product Attention operation (#(3)
). Meanwhile, the second linear layer (self.linear
) is configured to accept a token sequence where the dimensionality of each token is 512 (D_MODEL
), and return another sequence with the exact same size. This layer will later be employed to combine the information from all attention heads.
Now let's move on to the forward()
method of the SelfAttention()
class. Below is what it looks like.
# Codeblock 11
def forward(self, x):
print(f"originaltt: {x.shape}")
x = self.qkv_linear(x) #(1)
print(f"after qkv_lineart: {x.shape}")
x = x.reshape(BATCH_SIZE, SEQ_LENGTH, NUM_HEADS, 3*HEAD_DIM) #(2)
print(f"after reshapett: {x.shape}")
x = x.permute(0, 2, 1, 3) #(3)
print(f"after permutett: {x.shape}")
q, k, v = x.chunk(3, dim=-1) #(4)
print(f"qttt: {q.shape}")
print(f"kttt: {k.shape}")
print(f"vttt: {v.shape}")
attn_output, attn_output_weights = self.attention(q, k, v,
look_ahead_mask=self.look_ahead_mask) #(5)
print(f"attn_outputtt: {attn_output.shape}")
print(f"attn_output_weightst: {attn_output_weights.shape}")
x = attn_output.permute(0, 2, 1, 3) #(6)
print(f"after permutett: {x.shape}")
x = x.reshape(BATCH_SIZE, SEQ_LENGTH, NUM_HEADS*HEAD_DIM) #(7)
print(f"after reshapett: {x.shape}")
x = self.linear(x) #(8)
print(f"after lineartt: {x.shape}")
return x
Here we can see that the input tensor x
is directly processed with the self.qkv_linear
layer (#(1)
). The resulting tensor is then reshaped to BATCH_SIZE
× SEQ_LENGTH
× NUM_HEADS
× 3*HEAD_DIM
as demonstrated at line #(2)
. Next, the permute()
method is used to swap the SEQ_LENGTH
and NUM_HEADS
axes (#(3)
). Such a reshaping and permutation process is actually a trick to distribute the 1536-dimensional token vectors into 8 attention heads, allowing them to be processed in parallel without needing to be separated into different tensors. Next, we use the chunk()
method to divide the tensor into 3 parts, which will correspond to q
, k
and v
(#(4)
). One thing to keep in mind is that the division will operate on the last (token embedding) dimension, leaving the sequence length axis unchanged.
With the query, key, and value ready, we can now pass them all together through the Scaled Dot-Product Attention block (#(5)
). Although the attention mechanism returns two tensors, in this case we will only bring attn_output
to the next process since it is the one that actually contains the context-aware token sequence (recall that attn_output_weights
is just a matrix containing the relationships between tokens). The next step to be done is to swap back the HEAD_DIM
and the SEQ_LENGTH
axes (#(6)
) before eventually reshaping it back to the original dimension (#(7)
). If you take a closer look at this line, you will see that NUM_HEADS
is directly multiplied with HEAD_DIM
. This operation effectively flattens the embeddings from the 8 attention heads back into a single dimension, which is conceptually similar to concatenating the output of each head together, as illustrated in Figure 7. Lastly, to actually combine the information from these 8 heads, we need to pass the tensor through another linear layer we discussed earlier (#(8)
). We can think of the operation done in this linear layer as an approach to let the attention heads interacting with each other, which results in a better context understanding.
Let's test the code by running the codeblock below. – By the way, here I re-run all previous codeblocks with the print()
function commented out since I only want to focus on the flow of the SelfAttention()
class we just created.
# Codeblock 12
self_attention = SelfAttention()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = self_attention(x)
# Codeblock 12 output
original : torch.Size([1, 200, 512]) #(1)
after qkv_linear : torch.Size([1, 200, 1536]) #(2)
after reshape : torch.Size([1, 200, 8, 192]) #(3)
after permute : torch.Size([1, 8, 200, 192]) #(4)
q : torch.Size([1, 8, 200, 64]) #(5)
k : torch.Size([1, 8, 200, 64]) #(6)
v : torch.Size([1, 8, 200, 64]) #(7)
attn_output : torch.Size([1, 8, 200, 64]) #(8)
attn_output_weights : torch.Size([1, 8, 200, 200])
after permute : torch.Size([1, 200, 8, 64]) #(9)
after reshape : torch.Size([1, 200, 512]) #(10)
after linear : torch.Size([1, 200, 512]) #(11)
Based on the output above, we can see that our tensor successfully flows through all the layers. The input tensor, which initially has the size of 200×512 (#(1)
), becomes 200×1536 thanks to the expansion done by the first linear layer (#(2)
). The last dimension of the tensor is then distributed evenly into 8 attention heads, resulting in each head processing 192-dimensional token vectors (#(3)
). The permutation done to swap the 8-attention head axis with the 200-sequence length axis is essentially just a method that allows PyTorch to do the computation for each head in parallel (#(4)
). Next, at line #(5)
to #(7)
you can see that each of the q
, k
, and v
tensors has the dimension of 200×64 for each head, which matches our discussion for Codeblock 9. After being processed with Attention()
layer, we got the attn_output
tensor which is then permuted (#(9)
) and reshaped (#(10)
) to the original input tensor dimension. – It is important to note that the permutation and reshaping operations need to be performed in this exact order because we initially changed its dimension by reshaping followed by a permutation. Technically, you could revert to the original dimension without permuting, but that would mess up your tensor elements. So, you really need to keep this in mind. – Finally, the last step to be done is to pass the tensor through the second linear layer in the SelfAttention()
block, which does not change the tensor dimension at all (#(11)
).
Multihead Cross-Attention
If the Self-Attention layer is used to capture the relationships between all tokens within the same sequence, the Cross-Attention layer captures relationships between tokens in two different sequences, i.e., between the translated sentence and the original sentence. By doing so, the model can obtain context from the original language for each token in the translated language. You can find this mechanism in the second Multihead Attention layer in the Decoder. Below is what it actually looks like.

You can see in Figure 12 above that the arrows coming into the attention layer are from different sources. The arrow on the left and middle are key and value coming from the Encoder, while the arrow on the right is the query from the Decoder itself. We can think of this mechanism as the Decoder querying information from the Encoder. Furthermore, remember that since the Encoder accepts and reads the entire sequence at once, hence we don't need to implement look-ahead mask to this attention block so that it can access the full context from the original sequence even during the inference phase.
The implementation of a Cross-Attention layer is a little bit different from Self-Attention. As you can see in Codeblock 13 below, there are three linear layers to be implemented. The first one is self.kv_linear
which is responsible to double the token embedding dimension of the tensor coming from the Encoder (#(1)
). As you probably have guessed, the resulting tensor will later be divided into two, each representing key and value. The second linear layer is named self.q_linear
, which the output tensor will act as the query (#(2)
). Lastly, the role of the self.linear
layer is the same as the one in Self-Attention, i.e., to combine the information from all attention heads without changing its dimension (#(3)
).
# Codeblock 13
class CrossAttention(nn.Module):
def __init__(self):
super().__init__()
self.kv_linear = nn.Linear(D_MODEL, 2*D_MODEL) #(1)
self.q_linear = nn.Linear(D_MODEL, D_MODEL) #(2)
self.attention = Attention()
self.linear = nn.Linear(D_MODEL, D_MODEL) #(3)
The forward()
method of the CrossAttention()
class accepts two inputs: x_enc
and x_dec
as shown at line #(1)
in Codeblock 14, where the former input denotes the tensor coming from the Encoder, while the latter represents the one from the Decoder.
# Codeblock 14
def forward(self, x_enc, x_dec): #(1)
print(f"x_enc originaltt: {x_enc.shape}")
print(f"x_dec originaltt: {x_dec.shape}")
x_enc = self.kv_linear(x_enc) #(2)
print(f"nafter kv_lineartt: {x_enc.shape}")
x_enc = x_enc.reshape(BATCH_SIZE, SEQ_LENGTH, NUM_HEADS, 2*HEAD_DIM) #(3)
print(f"after reshapett: {x_enc.shape}")
x_enc = x_enc.permute(0, 2, 1, 3) #(4)
print(f"after permutett: {x_enc.shape}")
k, v = x_enc.chunk(2, dim=-1) #(5)
print(f"kttt: {k.shape}")
print(f"vttt: {v.shape}")
x_dec = self.q_linear(x_dec) #(6)
print(f"nafter q_lineartt: {x_dec.shape}")
x_dec = x_dec.reshape(BATCH_SIZE, SEQ_LENGTH, NUM_HEADS, HEAD_DIM) #(7)
print(f"after reshapett: {x_dec.shape}")
q = x_dec.permute(0, 2, 1, 3) #(8)
print(f"after permute (q)t: {q.shape}")
attn_output, attn_output_weights = self.attention(q, k, v) #(9)
print(f"nattn_outputtt: {attn_output.shape}")
print(f"attn_output_weightst: {attn_output_weights.shape}")
x = attn_output.permute(0, 2, 1, 3)
print(f"after permutett: {x.shape}")
x = x.reshape(BATCH_SIZE, SEQ_LENGTH, NUM_HEADS*HEAD_DIM)
print(f"after reshapett: {x.shape}")
x = self.linear(x)
print(f"after lineartt: {x.shape}")
return x
The x_enc
and x_dec
tensors are processed separately using similar steps to those in the Self-Attention layer, i.e., processing with linear layer, reshaping, and permuting. Notice that the processes done for these two input tensors are essentially the same. For example, line #(2)
is equivalent to line #(6)
, line #(3)
corresponds to line #(7)
, and line #(4)
matches line #(8)
. We apply the chunk()
method to split the x_enc
tensor into key and value (#(5)
), whereas in the case of x_dec
, we don't need to do so as it will directly serve as the query tensor. Next, we feed q
, k
, and v
into the Scaled Dot-Product Attention layer (#(9)
). This is actually the step where the information from the Encoder is queried by the Decoder. Additionally, keep in mind that here we should not pass the look-ahead mask parameter since we want to leave the attention weights unmasked. Next, I don't think I need to explain the remaining steps because these are all exactly the same as the one in the Multihead Self-Attention mechanism, which we already discussed in the previous section.
Now let's test our CrossAttention()
class by passing dummy x_enc
and x_dec
tensors. See Codeblock 15 and the output below for the details.
# Codeblock 15
cross_attention = CrossAttention()
x_enc = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x_dec = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = cross_attention(x_enc, x_dec)
# Codeblock 15 output
x_enc original : torch.Size([1, 200, 512]) #(1)
x_dec original : torch.Size([1, 200, 512]) #(2)
after kv_linear : torch.Size([1, 200, 1024]) #(3)
after reshape : torch.Size([1, 200, 8, 128])
after permute : torch.Size([1, 8, 200, 128])
k : torch.Size([1, 8, 200, 64]) #(4)
v : torch.Size([1, 8, 200, 64]) #(5)
after q_linear : torch.Size([1, 200, 512])
after reshape : torch.Size([1, 200, 8, 64])
after permute (q) : torch.Size([1, 8, 200, 64]) #(6)
attn_output : torch.Size([1, 8, 200, 64]) #(7)
attn_output_weights : torch.Size([1, 8, 200, 200])
after permute : torch.Size([1, 200, 8, 64])
after reshape : torch.Size([1, 200, 512]) #(8)
after linear : torch.Size([1, 200, 512]) #(9)
Initially, both x_enc
and x_dec
tensors have the exact same dimensions, as shown at line #(1)
and #(2)
in the above output. After being passed through the self.kv_linear
layer, the embedding dimension of x_enc
expands from 512 to 1024 (#(3)
). This means that each token is now represented by a 1024-dimensional vector. This tensor is then reshaped, permuted, and chunked, so that it becomes k
and v
. At this point the embedding dimensions of these two tensors are already split into 8 attention heads, ready to be used as the input for the Scaled Dot-Product Attention layer (#(4)
and #(5)
). We do also apply the reshaping and permuting steps to x_dec
, yet we omit the chunking process since this entire tensor will act as the q
(#(6)
). As the process is done, now that the q
, k
, and v
tensors are having the exact same dimensions, which is the same as what we have earlier in the SelfAttention()
block. Processing with the self.attention
layer results in attn_output
tensor (#(7)
) which will later be permuted and reshaped back to the initial tensor dimension (#(8)
). Finally, after being processed with self.linear
layer (#(9)
), our tensor is now containing the translated language which already has the contextual information from the original language.
Feed Forward Blocks
Our previous discussion about attention mechanism was quite intense, especially for those who have never heard about this before – well, at least for me when I first tried to understand this idea. – To give our brain a little bit of rest, let's shift our focus to the simplest component of Transformer: the Feed Forward block.
In the Transformer architecture, you will find two identical Feed Forward blocks – one in the Encoder and another one in the Decoder. Take a look at Figure 13 below to see where they are located. By implementing Feed Forward blocks like this, the depth of the network will increase, and so does the number of learnable parameters. This allows the network to capture more complex patterns in the data so that it does not rely solely on the information extracted by the attention blocks.

Each of the two Feed Forward blocks above consists of a stack of two linear layers with a ReLU activation function and a dropout layer in between. The implementation of this structure is very easy, as you can just stack these layers one after another like what I do in Codeblock 16 below.
# Codeblock 16
class FeedForward(nn.Module):
def __init__(self):
super().__init__()
self.linear_0 = nn.Linear(D_MODEL, HIDDEN_DIM) #(1)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(p=DROP_PROB) #(2)
self.linear_1 = nn.Linear(HIDDEN_DIM, D_MODEL) #(3)
def forward(self, x):
print(f"originalt: {x.shape}")
x = self.linear_0(x)
print(f"after linear_0t: {x.shape}")
x = self.relu(x)
print(f"after relut: {x.shape}")
x = self.dropout(x)
print(f"after dropoutt: {x.shape}")
x = self.linear_1(x)
print(f"after linear_1t: {x.shape}")
return x
There are several things I want to emphasize in the above codeblock. First the self.linear_0
layer is configured to accept a tensor of size D_MODEL
(512) and expands it to HIDDEN_DIM
(2048) as shown at line #(1)
. As I've mentioned earlier, we do this so that the model can extract more information from the dataset. This tensor dimension will eventually be shrunk back down to 512 by the self.linear_1
layer (#(3)
), which helps keep the input and output dimensions consistent throughout the network. Next, here we set the rate of our dropout layer to DROP_PROB
(0.1) (#(2)
) according to the configuration table provided in Figure 2. As for the forward()
method, I won't go into the details as it simply connects the layers we initialized in the __init__()
method.
As usual, here I test the FeedForward()
class by passing through a tensor of size BATCH_SIZE
×SEQ_LENGTH
×D_MODEL
as shown in Codeblock 17.
# Codeblock 17
feed_forward = FeedForward()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = feed_forward(x)
# Codeblock 17 output
original : torch.Size([1, 200, 512])
after linear_0 : torch.Size([1, 200, 2048]) #(1)
after relu : torch.Size([1, 200, 2048])
after dropout : torch.Size([1, 200, 2048])
after linear_1 : torch.Size([1, 200, 512]) #(2)
We can see in the output above that the dimensionality of each token, which is initially 512, becomes 2048 thanks to the self.linear_0
layer (#(1)
). This tensor size remains unchanged until the dropout layer before eventually squeezed back to 512 by the self.linear_1
layer (#(2)
).
Layer Normalization
The last Transformer component I want to talk about is the one that you can see throughout the entire Encoder and Decoder, namely the Add & Norm block (colored in yellow in Figure 14).

As the name suggests, this block essentially comprises an element-wise addition and a layer normalization operation. However, for the sake of simplicity, in this section I will only focus on the normalization process. The element-wise addition will later be discussed when we assemble the entire Transformer architecture.
The purpose of implementing layer normalization in this case is to normalize the tensor right after being processed by the preceding block. Keep in mind that what we use here is layer normalization, not batch normalization. In case you're not yet familiar with Layer Norm, it essentially performs normalization where the statistics (i.e., mean and variance) are computed across the features (embedding dimensions) for each individual token. This is essentially the reason that in Figure 15 the color that I use for all embedding dimensions is the same for each token. On the other hand, in Batch Norm, the cells that have the same color spans across its batch and sequence dimension, indicating that the mean and variance are computed based on these axes.

You can see the implementation of a layer normalization mechanism in Codeblock 18. There are several variables that I need to initialize within the __init__()
method of the LayerNorm()
class. First, there is a small number called epsilon (#(1)
), which we need to define to prevent a division-by-zero error that potentially occur at line #(8)
. Next, we also need to initialize gamma (#(2)
) and beta (#(3)
). These two variables can be thought of as weight and bias in linear regression, where the gamma is responsible to scale the normalized output, whereas beta is for shifting it. By understanding this property, if we set gamma to be fixed to 1 and the beta to 0, then the normalized output values won't change. However, although we indeed use these two numbers for the initial gamma and beta, yet I set the requires_grad
parameter to True
so that they will get updated as the training goes.
# Codeblock 18
class LayerNorm(nn.Module):
def __init__(self, eps=1e-5):
super().__init__()
self.eps = eps #(1)
self.gamma = nn.Parameter(torch.ones(D_MODEL), requires_grad=True) #(2)
self.beta = nn.Parameter(torch.zeros(D_MODEL), requires_grad=True) #(3)
def forward(self, x): #(4)
print(f"originalt: {x.shape}")
mean = x.mean(dim=[-1], keepdim=True) #(5)
print(f"meantt: {mean.shape}")
var = ((x - mean) ** 2).mean(dim=[-1], keepdim=True) #(6)
print(f"vartt: {var.shape}")
stddev = (var + self.eps).sqrt() #(7)
print(f"stddevtt: {stddev.shape}")
x = (x - mean) / stddev #(8)
print(f"normalizedt: {x.shape}")
x = (self.gamma * x) + self.beta #(9)
print(f"after scaling and shiftingt: {x.shape}")
return x
To the forward()
method, it initially works by accepting tensor x
(#(4)
). Afterwards, we calculate the mean (#(5)
) and variance (#(6)
) from it. Remember that because we want to compute these statistics from each row, hence we need to use dim=[-1]
(since the embedding dimension is the last axis of the tensor). Next, we calculate the standard deviation (#(7)
) so that the normalized tensor can be obtained (#(8)
). Lastly, this normalized tensor will be rescaled using self.gamma
and self.beta
as shown at line #(9)
.
As the LayerNorm()
class has successfully been constructed, now that we will run the following codeblock to check if our implementation is correct.
# Codeblock 19
layer_norm = LayerNorm()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = layer_norm(x)
# Codeblock 19 output
original : torch.Size([1, 200, 512])
mean : torch.Size([1, 200, 1])
var : torch.Size([1, 200, 1])
stddev : torch.Size([1, 200, 1])
normalized : torch.Size([1, 200, 512])
after scaling and shifting : torch.Size([1, 200, 512])
We can see in the output above that all processes done inside the LayerNorm()
class do not alter the tensor dimension at all. The mean
, var
, and stddev
are just the statistics that we compute for each row (token), hence the embedding dimension collapses to 1 for these tensors. By the way, in case you're wondering why we use keepdim=True
, it is because setting it to False
would result in mean
, var
, and stddev
having the dimension of 1×200 rather than 1×200×1, which causes these tensors to be incompatible for the subsequent operations.
The Entire Transformer Architecture
At this point we have successfully created all components for the Transformer architecture, so they are now ready to be assembled. We will start by assembling the Encoder, followed by the Decoder, and finally, I will connect the two as well as the other remaining components.
Encoder
There are four blocks required to be placed sequentially in the Encoder, namely the Multihead Self-Attention, Layer Norm, Feed Forward, and another Layer Norm. Additionally, there are also two residual connections that skip over the Multihead Self-Attention block and the Feed Forward block. See the detailed structure in Figure 16 below.

Now let's discuss the implementation in the following codeblock.
# Codeblock 20
class Encoder(nn.Module):
def __init__(self):
super().__init__()
self.self_attention = SelfAttention(look_ahead_mask=False) #(1)
self.dropout_0 = nn.Dropout(DROP_PROB) #(2)
self.layer_norm_0 = LayerNorm() #(3)
self.feed_forward = FeedForward()
self.dropout_1 = nn.Dropout(DROP_PROB) #(4)
self.layer_norm_1 = LayerNorm() #(5)
def forward(self, x):
residual = x
print(f"original & residualt: {x.shape}")
x = self.self_attention(x) #(6)
print(f"after self attentiont: {x.shape}")
x = self.dropout_0(x) #(7)
print(f"after dropouttt: {x.shape}")
x = self.layer_norm_0(x + residual) #(8)
print(f"after layer normt: {x.shape}")
residual = x
print(f"nx & residualtt: {x.shape}")
x = self.feed_forward(x) #(9)
print(f"after feed forwardt: {x.shape}")
x = self.dropout_1(x)
print(f"after dropouttt: {x.shape}")
x = self.layer_norm_1(x + residual)
print(f"after layer normt: {x.shape}")
return x
I initialize the four blocks mentioned earlier in the __init__()
method of the Encoder()
class. Remember that since the Encoder reads the entire input sequence at once, hence we need to set the look_ahead_mask
parameter to False
so that every single token can attend to all other tokens (#(1)
). Next, the two Layer Norm blocks are initialized separately, which I name them self.layer_norm_0
and self.layer_norm_1
as shown at line #(3)
and #(5)
. Here I also initialize two dropout layers at line #(2)
and #(4)
which will later be placed just before each normalization block.
In the forward()
method, we first copy the x
tensor to the residual
variable, so that we can process x
with the Multihead Self-Attention layer (#(6)
) without affecting the original tensor. Next, we pass the resulting output through the first dropout layer (#(7)
). Note that the Layer Norm block doesn't just use the output tensor from the dropout layer. Instead, we also need to inject the residual
tensor to x
by element-wise addition before applying the normalization step (#(8)
). Afterwards, we repeat the same processes, except that this time we replace the Multihead Self-Attention block with the Feed Forward network (#(9)
).
If you remember the classes I created earlier, you will notice that all of them, – specifically the ones intended to be placed inside the Encoder and Decoder, – have the exact same input and output dimension. We can check this by passing a tensor through the entire Encoder architecture as shown in Codeblock 21 below. You will see in the output that the tensor size at each process is exactly the same.
# Codeblock 21
encoder = Encoder()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = encoder(x)
# Codeblock 21 output
original & residual : torch.Size([1, 200, 512])
after self attention : torch.Size([1, 200, 512])
after dropout : torch.Size([1, 200, 512])
after layer norm : torch.Size([1, 200, 512])
x & residual : torch.Size([1, 200, 512])
after feed forward : torch.Size([1, 200, 512])
after dropout : torch.Size([1, 200, 512])
after layer norm : torch.Size([1, 200, 512])
Decoder
The Decoder architecture, which you can see in Figure 17, is a little bit longer than the Encoder. Initially, the tensor passed into it will be processed with a Masked Multihead Self-Attention layer. Next, we send the resulting tensor as the query input for the subsequent Multihead Cross-Attention layer. The key and value input for this layer will be obtained from the Encoder output. Lastly, we propagate the tensor through the Feed Forward block. Remember that here we will also implement the layer normalization operations as well as the residual connections.

Talking about the implementation in Codeblock 22, we need to initialize two attention blocks inside the __init__()
method. The first one is SelfAttention()
with look_ahead_mask=True
(#(1)
), and the second one is CrossAttention()
(#(3)
). Here I will also apply the dropout layers which I initialize at line #(2)
, #(4)
and #(5)
.
# Codeblock 22
class Decoder(nn.Module):
def __init__(self):
super().__init__()
self.self_attention = SelfAttention(look_ahead_mask=True) #(1)
self.dropout_0 = nn.Dropout(DROP_PROB) #(2)
self.layer_norm_0 = LayerNorm()
self.cross_attention = CrossAttention() #(3)
self.dropout_1 = nn.Dropout(DROP_PROB) #(4)
self.layer_norm_1 = LayerNorm()
self.feed_forward = FeedForward()
self.dropout_2 = nn.Dropout(DROP_PROB) #(5)
self.layer_norm_2 = LayerNorm()
def forward(self, x_enc, x_dec): #(6)
residual = x_dec
print(f"x_dec & residualt: {x_dec.shape}")
x_dec = self.self_attention(x_dec) #(7)
print(f"after self attentiont: {x_dec.shape}")
x_dec = self.dropout_0(x_dec)
print(f"after dropouttt: {x_dec.shape}")
x_dec = self.layer_norm_0(x_dec + residual) #(8)
print(f"after layer normt: {x_dec.shape}")
residual = x_dec
print(f"nx_dec & residualt: {x_dec.shape}")
x_dec = self.cross_attention(x_enc, x_dec) #(9)
print(f"after cross attentiont: {x_dec.shape}")
x_dec = self.dropout_1(x_dec)
print(f"after dropouttt: {x_dec.shape}")
x_dec = self.layer_norm_1(x_dec + residual)
print(f"after layer normt: {x_dec.shape}")
residual = x_dec
print(f"nx_dec & residualt: {x_dec.shape}")
x_dec = self.feed_forward(x_dec) #(10)
print(f"after feed forwardt: {x_dec.shape}")
x_dec = self.dropout_2(x_dec)
print(f"after dropouttt: {x_dec.shape}")
x_dec = self.layer_norm_2(x_dec + residual)
print(f"after layer normt: {x_dec.shape}")
return x_dec
Meanwhile, the forward()
method, even though it is basically just a stack of layers placed one after another, but there are several things I need to highlight. First, this method accepts two input parameters: x_enc
and x_dec
(#(6)
). As the name suggests, the former is the tensor coming from the Encoder, while the latter is the one we obtain from the previous layer in the Decoder. We initially only work with the x_dec
tensor, which is processed using the first attention (#(7)
) and layer normalization (#(8)
) blocks. As this process is done, we now use x_enc
alongside the processed x_dec
as the input for the cross_attention
layer (#(9)
), which is where our model fuses information from the Encoder and the Decoder. Lastly, the resulting output will be fed into the Feed Forward block (#(10)
).
We do the testing by passing two tensors of the same dimensions to simulate the actual x_enc
and x_dec
. Based on the output of the following codeblock, we can see that these two tensors successfully pass through the entire processes, indicating that we have constructed the Decoder correctly.
# Codeblock 23
decoder = Decoder()
x_enc = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x_dec = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = decoder(x_enc, x_dec)
# Codeblock 23 output
x_dec & residual : torch.Size([1, 200, 512])
after self attention : torch.Size([1, 200, 512])
after dropout : torch.Size([1, 200, 512])
after layer norm : torch.Size([1, 200, 512])
x_dec & residual : torch.Size([1, 200, 512])
after cross attention : torch.Size([1, 200, 512])
after dropout : torch.Size([1, 200, 512])
after layer norm : torch.Size([1, 200, 512])
x_dec & residual : torch.Size([1, 200, 512])
after feed forward : torch.Size([1, 200, 512])
after dropout : torch.Size([1, 200, 512])
after layer norm : torch.Size([1, 200, 512])
Combining Encoder and Decoder
As we have successfully created the Encoder()
and Decoder()
class, we can now get into the very last part of this writing: connecting the Encoder to the Decoder along with the other components that interact with them. Here I provide you a figure showing the entire Transformer architecture for reference, so you don't need to scroll all the way to Figure 1 just to verify our implementation in Codeblock 24 and 25.

In the codeblock below, I implement the architecture inside the Transformer()
class. In the __init__()
method, we initialize the input and output embedding layers (#(1)
and #(2)
). These two layers are responsible for converting tokens into their corresponding 512-dimensional vector representations. Next, we initialize a single positional_encoding
layer which will be used twice: one for the embedded input tokens, and another one for the embedded output tokens. Meanwhile, the initialization of the Encoder (#(4)
) and the Decoder (#(5)
) blocks is a little bit different, where in this case we utilize nn.ModuleList()
. We can think of this like a list of modules which we will connect sequentially later in the forward pass, and in this case each is repeated N
(6) times. In fact, this is essentially why I name them self.encoders
and self.decoders
(with s). The last thing we need to do in the __init__()
method is to initialize the self.linear
layer (#(6)
), in which it will be responsible to map the 512-dimensional token embeddings to all possible tokens in the destination language. We can perceive this like a classification task, where the model will choose one token at a time as the prediction result based on their probability scores.
# Codeblock 24
class Transformer(nn.Module):
def __init__(self):
super().__init__()
self.input_embedding = InputEmbedding() #(1)
self.output_embedding = OutputEmbedding() #(2)
self.positional_encoding = PositionalEncoding() #(3)
self.encoders = nn.ModuleList([Encoder() for _ in range(N)]) #(4)
self.decoders = nn.ModuleList([Decoder() for _ in range(N)]) #(5)
self.linear = nn.Linear(D_MODEL, VOCAB_SIZE_DST) #(6)
The way our forward()
method works is a little bit unusual. Remember that the entire Transformer accepts two inputs: a sequence from the original language, and the shifted-right sequence from the translated language. Hence, in Codeblock 25 below, you will see that this method accepts two sequences: x_enc_raw
and x_dec_raw
(#(1)
). The _raw
suffix I use indicates that it is a raw token sequence, i.e., a sequence of integers, not the tokens that have been converted into 512-dimensional vectors. This conversion will then be done at line #(2)
and #(5)
. Afterwards, we will inject positional encoding to the resulting tensors by element-wise addition, which is done at line #(3)
for the sequence to be fed into the Encoder, and #(6)
for the one to be passed through the Decoder. Next, we use a loop to feed the output of an Encoder block into the subsequent one sequentially (#(4)
). We also do the similar thing to the Decoder blocks, except that each of these accepts both x_enc
and x_dec
(#(7)
). What you need to notice at this point is that the x_enc
to be fed into the Decoder block is only the one coming out from the last Encoder block. Meanwhile, the x_dec
tensor to be fed into the next Decoder is always the one produced by the previous Decoder block. – You can verify this by taking a closer look at line #(7)
, where x_dec
is updated at each iteration while x_enc
is not. – Lastly, once the Decoder loop is completed, we will pass the resulting tensor to the linear layer (#(8)
). If you take a look at Figure 18, you will notice that there is a softmax layer placed after this linear layer. However, we won't implement it here because in PyTorch it is already included in the loss function.
# Codeblock 25
def forward(self, x_enc_raw, x_dec_raw): #(1)
print(f"x_enc_rawtt: {x_enc_raw.shape}")
print(f"x_dec_rawtt: {x_dec_raw.shape}")
# Encoder
x_enc = self.input_embedding(x_enc_raw) #(2)
print(f"nafter input embeddingt: {x_enc.shape}")
x_enc = x_enc + self.positional_encoding() #(3)
print(f"after pos encodingt: {x_enc.shape}")
for i, encoder in enumerate(self.encoders):
x_enc = encoder(x_enc) #(4)
print(f"after encoder #{i}t: {x_enc.shape}")
# Decoder
x_dec = self.output_embedding(x_dec_raw) #(5)
print(f"nafter output embeddingt: {x_dec.shape}")
x_dec = x_dec + self.positional_encoding() #(6)
print(f"after pos encodingt: {x_dec.shape}")
for i, decoder in enumerate(self.decoders):
x_dec = decoder(x_enc, x_dec) #(7)
print(f"after decoder #{i}t: {x_dec.shape}")
x = self.linear(x_dec) #(8)
print(f"nafter lineartt: {x.shape}")
return x
As the Transformer()
class is completed, now that we can test it with the following codeblock. You can see in the resulting output that our x_enc_raw
and x_dec_raw
successfully passed through the entire Transformer architecture, which essentially means that our network is finally ready to be trained for seq2seq tasks.
# Codeblock 26
transformer = Transformer()
x_enc_raw = torch.randint(0, VOCAB_SIZE_SRC, (BATCH_SIZE, SEQ_LENGTH))
x_dec_raw = torch.randint(0, VOCAB_SIZE_DST, (BATCH_SIZE, SEQ_LENGTH))
y = transformer(x_enc_raw, x_dec_raw).shape
# Codeblock 26 output
x_enc_raw : torch.Size([1, 200])
x_dec_raw : torch.Size([1, 200])
after input embedding : torch.Size([1, 200, 512])
after pos encoding : torch.Size([1, 200, 512])
after encoder #0 : torch.Size([1, 200, 512])
after encoder #1 : torch.Size([1, 200, 512])
after encoder #2 : torch.Size([1, 200, 512])
after encoder #3 : torch.Size([1, 200, 512])
after encoder #4 : torch.Size([1, 200, 512])
after encoder #5 : torch.Size([1, 200, 512])
after output embedding : torch.Size([1, 200, 512])
after pos encoding : torch.Size([1, 200, 512])
after decoder #0 : torch.Size([1, 200, 512])
after decoder #1 : torch.Size([1, 200, 512])
after decoder #2 : torch.Size([1, 200, 512])
after decoder #3 : torch.Size([1, 200, 512])
after decoder #4 : torch.Size([1, 200, 512])
after decoder #5 : torch.Size([1, 200, 512])
after linear : torch.Size([1, 200, 120])
Talking more specifically about the flow, you can see here that the tensor dimensions throughout the entire Encoder and Decoder blocks are consistent. This kind of property allows us to scale the model easily. So, for example, if we want to increase the model complexity to improve its ability in understanding larger dataset, we can simply stack more Encoders and Decoders. Or, if you want the model to be more efficient, you can just decrease the number of these blocks. – In fact, not only the number of Encoders and Decoders, but you can basically change the value of all parameters defined in Codeblock 2 according to your needs.
The following code is an optional step, but in case you're wondering what the overall structure of the Transformer architecture looks like, you can just run it.
# Codeblock 27
transformer = Transformer()
summary(transformer, input_data=(x_enc_raw, x_dec_raw))
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
Transformer [1, 200, 120] --
├─InputEmbedding: 1-1 [1, 200, 512] --
│ └─Embedding: 2-1 [1, 200, 512] 51,200
├─PositionalEncoding: 1-2 [200, 512] --
├─ModuleList: 1-3 -- --
│ └─Encoder: 2-2 [1, 200, 512] --
│ │ └─SelfAttention: 3-1 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-2 [1, 200, 512] --
│ │ └─LayerNorm: 3-3 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-4 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-5 [1, 200, 512] --
│ │ └─LayerNorm: 3-6 [1, 200, 512] 1,024
│ └─Encoder: 2-3 [1, 200, 512] --
│ │ └─SelfAttention: 3-7 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-8 [1, 200, 512] --
│ │ └─LayerNorm: 3-9 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-10 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-11 [1, 200, 512] --
│ │ └─LayerNorm: 3-12 [1, 200, 512] 1,024
│ └─Encoder: 2-4 [1, 200, 512] --
│ │ └─SelfAttention: 3-13 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-14 [1, 200, 512] --
│ │ └─LayerNorm: 3-15 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-16 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-17 [1, 200, 512] --
│ │ └─LayerNorm: 3-18 [1, 200, 512] 1,024
│ └─Encoder: 2-5 [1, 200, 512] --
│ │ └─SelfAttention: 3-19 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-20 [1, 200, 512] --
│ │ └─LayerNorm: 3-21 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-22 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-23 [1, 200, 512] --
│ │ └─LayerNorm: 3-24 [1, 200, 512] 1,024
│ └─Encoder: 2-6 [1, 200, 512] --
│ │ └─SelfAttention: 3-25 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-26 [1, 200, 512] --
│ │ └─LayerNorm: 3-27 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-28 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-29 [1, 200, 512] --
│ │ └─LayerNorm: 3-30 [1, 200, 512] 1,024
│ └─Encoder: 2-7 [1, 200, 512] --
│ │ └─SelfAttention: 3-31 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-32 [1, 200, 512] --
│ │ └─LayerNorm: 3-33 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-34 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-35 [1, 200, 512] --
│ │ └─LayerNorm: 3-36 [1, 200, 512] 1,024
├─OutputEmbedding: 1-4 [1, 200, 512] --
│ └─Embedding: 2-8 [1, 200, 512] 61,440
├─PositionalEncoding: 1-5 [200, 512] --
├─ModuleList: 1-6 -- --
│ └─Decoder: 2-9 [1, 200, 512] --
│ │ └─SelfAttention: 3-37 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-38 [1, 200, 512] --
│ │ └─LayerNorm: 3-39 [1, 200, 512] 1,024
│ │ └─CrossAttention: 3-40 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-41 [1, 200, 512] --
│ │ └─LayerNorm: 3-42 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-43 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-44 [1, 200, 512] --
│ │ └─LayerNorm: 3-45 [1, 200, 512] 1,024
│ └─Decoder: 2-10 [1, 200, 512] --
│ │ └─SelfAttention: 3-46 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-47 [1, 200, 512] --
│ │ └─LayerNorm: 3-48 [1, 200, 512] 1,024
│ │ └─CrossAttention: 3-49 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-50 [1, 200, 512] --
│ │ └─LayerNorm: 3-51 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-52 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-53 [1, 200, 512] --
│ │ └─LayerNorm: 3-54 [1, 200, 512] 1,024
│ └─Decoder: 2-11 [1, 200, 512] --
│ │ └─SelfAttention: 3-55 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-56 [1, 200, 512] --
│ │ └─LayerNorm: 3-57 [1, 200, 512] 1,024
│ │ └─CrossAttention: 3-58 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-59 [1, 200, 512] --
│ │ └─LayerNorm: 3-60 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-61 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-62 [1, 200, 512] --
│ │ └─LayerNorm: 3-63 [1, 200, 512] 1,024
│ └─Decoder: 2-12 [1, 200, 512] --
│ │ └─SelfAttention: 3-64 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-65 [1, 200, 512] --
│ │ └─LayerNorm: 3-66 [1, 200, 512] 1,024
│ │ └─CrossAttention: 3-67 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-68 [1, 200, 512] --
│ │ └─LayerNorm: 3-69 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-70 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-71 [1, 200, 512] --
│ │ └─LayerNorm: 3-72 [1, 200, 512] 1,024
│ └─Decoder: 2-13 [1, 200, 512] --
│ │ └─SelfAttention: 3-73 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-74 [1, 200, 512] --
│ │ └─LayerNorm: 3-75 [1, 200, 512] 1,024
│ │ └─CrossAttention: 3-76 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-77 [1, 200, 512] --
│ │ └─LayerNorm: 3-78 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-79 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-80 [1, 200, 512] --
│ │ └─LayerNorm: 3-81 [1, 200, 512] 1,024
│ └─Decoder: 2-14 [1, 200, 512] --
│ │ └─SelfAttention: 3-82 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-83 [1, 200, 512] --
│ │ └─LayerNorm: 3-84 [1, 200, 512] 1,024
│ │ └─CrossAttention: 3-85 [1, 200, 512] 1,050,624
│ │ └─Dropout: 3-86 [1, 200, 512] --
│ │ └─LayerNorm: 3-87 [1, 200, 512] 1,024
│ │ └─FeedForward: 3-88 [1, 200, 512] 2,099,712
│ │ └─Dropout: 3-89 [1, 200, 512] --
│ │ └─LayerNorm: 3-90 [1, 200, 512] 1,024
├─Linear: 1-7 [1, 200, 120] 61,560
==========================================================================================
Total params: 44,312,696
Trainable params: 44,312,696
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 44.28
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 134.54
Params size (MB): 177.25
Estimated Total Size (MB): 311.79
==========================================================================================
Ending
And that's all for today's tutorial about Transformer and its PyTorch implementation. I would like to congratulate those who followed through all the discussions above, as you've spent more than 40 minutes to read this article! By the way, feel free to comment if you spot any mistake in my explanation or the code.
I hope you find this article useful. Thanks for reading, and see ya in the next one!
_P.S. Here's the link to the GitHub repository._
References
[1] Ashish Vaswani et al. Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed September 29, 2024].
[2] Image created originally by author.
[3] Sheng Shen et al. PowerNorm: Rethinking Batch Normalization in Transformers. Arxiv. https://arxiv.org/abs/2003.07845 [Accessed October 3, 2024].