The Ultimate Guide to Vision Transformers

Author:Murphy  |  View: 24124  |  Time: 2025-03-23 11:36:42

Hi everyone! For those who do not know me yet, my name is Francois, I am a Research Scientist at Meta. I have a passion for explaining advanced AI concepts and making them more accessible.

Today, let's dive into one of the most significant contribution in the field of Computer Vision: the Vision Transformer (ViT).

Converting an image into patches, image by author

A bit of history first..

The Vision Transformer was introduced by Alexey Dosovitskiy and al. (Google Brain) in 2021 in the paper An Image is worth 16×16 words. At the time, Transformers had shown to be the key to unlock great performance on NLP tasks, introduced in the must paper Attention is All you Need in 2017.

Between 2017 and 2021, there were several attempts to integrate the attention mechanism into Convolutional Neural Networks (CNNs). However, these were mostly hybrid models (combining CNN layers with attention layers) and lacked scalability. Google addressed this by completely eliminating convolutions and leveraging their computational power to scale the model.

The million-dollar question this article answered is..

The Google Vision team followed guidelines provided by the other team in Google that had designed the Transformers for text. The key challenge they addressed was:

"How can the attention mechanism be adapted for images?"

In NLP, tokens (words or subwords) serve as the basis for computing attention. However, images do not naturally lend themselves to such tokenization. Should a single pixel be considered a token? Or should the entire image be treated as one?

Considering each pixel as a unit would require computing the attention mechanism across all pixels. For a low-resolution image like 224×224 (which contains 50,176 pixels), this would necessitate approximately 2.5 billion operations – an impractical task with current technology.

Conversely, treating the entire image as a single token is too simplistic. The solution lies in between: converting the image into a sequence of patches. In their paper, the authors used patches with a resolution of 16×16 pixels.

Vision Transformer Architecture:

VIT Architecture, image taken from "An image is worth 16×16 words"

Key notations:

  • P = 16: Patch size
  • H, W: Height and Width of the image, which must be divisible by P
  • C = 3: Number of channels (RGB)
  • D: Latent vector size, representing the dimension of the patch token once flattened.

Mathematically:

Image by author

This is the most important part to understand. Once we have a sequence of tokens, we apply a Transformer encoder. We just have to understand how to add positional encoding to these tokens, and how to get a single vector representation from all tokens.

The CLS (Class) Token: A representation of the whole image

If you're familiar with the attention mechanism, you'll recognize that starting with N tokens and applying L layers of attention results in N tokens – one for each patch. This forms the "feature map," where each patch is encoded into a vector (token) of dimension D.

However, to classify an image, we need a single vector to represent it. While it's possible to average or "pool" all N tokens into a single vector, the authors adopted a method similar to BERT by introducing a token specifically for this usage: the CLS token.

This token is appended to the other N tokens from the image. So the input sequence is made of N+1 tokens.

Positional Encoding

Positional Encoding, image from original ViT paper

Feeding tokens directly into the attention mechanism would result in a lack of spatial awareness, as the mechanism wouldn't know the position of each patch. To address this, positional encoding is added to each token.

Positional encoding can be hard-coded (e.g., using sin/cos functions, as in the Attention is All You Need paper) or learned during training.

I am personally quite fan of the "bitter lesson" of Barton and Sutton, which is whenever we try to put an inductive bias to a model, we find that with enough data and scaling, it was in fact better to let the model learn it itself.

In the VIT paper, the positional encoding is learnable. From previous part, we saw that we end with a matrix of dimensions _(N+1, D). So the positional encoding is also matrix of dimensions (N+1, D)_ that is added.

Good! Now we have a solid overview of the ViT architecture.

Fine tuning at higher resolutions

In modern Deep Learning (e.g after the Transformer arrived in 2017), the standard approach to solve a problem became:

Step 1: Pre-train a very large neural network on a very large dataset

Step 2: Fine-tune it on the task we want to solve.

In computer vision, there's a cool trick that can boost performance: fine-tuning a Vision Transformer (ViT) at a higher resolution than the one used during pre-training (often low). But what does "fine-tuning at a higher resolution" really mean? Does it mean that we use smaller patches in the image so we have more tokens, or do we simply take a higher resolutio image? And how does it work?

Let's dive in!

This is personally a question I stumbled upon so I am going to explain it in depth.

When we refer to fine-tuning at a higher resolution, we mean increasing the image resolution, while keeping the patch resolution fixed.

For example, increasing the image resolution from 224×224 to 640×640 results in an increase from 196 to 1600 patches. This poses a challenge because the positional embedding matrix, originally sized for 196 tokens, no longer matches the new number of tokens.

So, what's the fix?

Interpolation. We extend the original positional embeddings by filling in the gaps using bicubic interpolation, effectively resizing the embeddings to match the new number of patches.

Interpolation, image by author

Scaling laws of ViT

Scaling laws of ViT, image taken from original paper

Unlike Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) lack built-in inductive biases such as spatial locality and translation invariance. This absence means that ViTs must learn these patterns purely from the data, making them highly data-hungry models.

Therefore we could wonder how does the performance of VIT evolve with more data and more parameters.

The nice thing with ViTs is that their performance scale well with more data and more parameters.

However, there's a catch. In scenarios with limited data, traditional CNNs often have the upper hand. CNNs are designed with inductive biases that make them more efficient at learning from smaller datasets. They leverage patterns like spatial hierarchies and local features, which allows them to perform better when data is sparse.

Therefore, if you're tackling a problem with limited data, CNNs might be a better choice. But if your dataset is large, ViTs may offer superior performance. The breakeven point depends on the specifics of your data.

What did the model learn exactly?

Let's break down some key insights:

  1. What Do the Embedding Filters Look Like?
  2. How Does the Model Learn Positional Embeddings?
  3. Does the Attention Mechanism Focus on Nearby Tokens or Distant Ones?

1. Embedding Filters:

It is quite interesting to note that the RGB embedding filters learned by the ViT ressemble those found in CNNs, capturing fundamental visual textures such as vertical and horizontal lines. Essentially, even though ViTs don't use convolutions, the embeddings they learn serve a similar purpose in identifying and representing basic image features.

2. Positional Embeddings:

When it comes to positional embeddings, ViTs develop grid-like structures. The learned embeddings often exhibit a pattern where the values are similar within the same row or column.

It's interesting to find that the model learns this kind of positional encoding by itself, it is able to understand the structure of the image (while it only sees a list of tokens).

3. Attention Mechanism:

The attention mechanism in ViTs evolves throughout the network layers. In the early stages, it tends to focus on nearby tokens, which is akin to how local features are captured. As you move deeper into the network, the attention mechanism shifts to a more global perspective, allowing the model to integrate information from distant tokens and understand high-level relationships across the entire image.

This progression from local to global attention highlights how ViTs can build increasingly complex representations as they process the image, enabling them to capture complex patterns.

In summary, while ViTs start by learning basic visual patterns and positional information, they gradually develop the capability to reason about larger and more abstract features of the image.

original paper

Congratulations, you've made it!

Thanks for reading! Before you go:

For more awesome tutorials, check my compilation of AI tutorials on Github

GitHub – FrancoisPorcher/awesome-ai-tutorials: The best collection of AI tutorials to make you a…

You should get my articles in your inbox. Subscribe here.

If you want to have access to premium articles on Medium, you only need a membership for $5 a month. If you sign up with my link, you support me with a part of your fee without additional costs.


If you found this article insightful and beneficial, please consider following me and leaving a clap for more in-depth content! Your support helps me continue producing content that aids our collective understanding.

References

"An Image is Worth 16×16 Words" by Alexey Dosovitskiy et al. (2021). You can read the full paper on arXiv.

Tags: Artificial Intelligence Computer Vision Data Science Deep Learning Transformers

Comment