Transformers: From NLP to Computer Vision

Introduction
In 2017, the paper "Attention is all you need" [1] took the NLP research community by storm. Cited more than 100,000 times so far, its Transformer has become the cornerstone of most major NLP architectures nowadays. To learn about Transformers' notable works in NLP, you can take a look at my previous post here.
Meanwhile, as Computer Vision (CV) has long been dominated by CNN, Transformer applications in the field have remained limited until recently. In this article, we will discuss the challenges of applying Transformers to computer vision and how CV researchers have adapted them.
Challenges
Tokenization
Tokenizing a text sequence has been long researched, using various optimizations to generalize and adapt to unseen texts. All of these efforts, however, rely on the idea of considering characters and terms as units.
For images, things are less straightforward. A naive approach is to use each pixel as a token. By applying the self-attention mechanism, each pixel needs to attend to all other pixels. This is too computation-intensive for high-resolution images. If not pixels, then what will be a good "token" unit?
Inductive bias
Compared to CNN, the Transformer structure lacks some important inductive biases, such as locality and translation equivariance [2].
- Locality: CNN assumes that pixels close to each other are related. In reverse, when 2 pixels are far away, they may share very little common information. Self-attention, however, lets each token attend to others globally, and locality is seldom used.
- Translation equivariance: Unlike text, when applying some general transformation on an image, such as rotation and shifting, the output should often stay the same or somewhat correlated to the original output. For example, an image of a banana and its flipped version should both be classified as "fruit". In other words, if our learned model is presented as a function f, and we apply transformation g on input image x, we expect f to learn that f(g(x)) = f(x). This is known as "translation invariance". CNN brings us one step closer to the ideal state by ensuring that f(g(x)) = g(f(x)). This is known as "translation equivariance".
Vision Transformer
Firstly, we will go through Vision Transformer (ViT) [3], one of the most significant attempts to apply the Transformer technique to CV, following the original Transformer work [1] as closely as possible.
Instead of using pixel-unit, each image is split into smaller patches and flattened into a 1D sequence. The original image size is (H, W, C), where H is height, W is width, C is the number of channels, and the patch size is (P, P, C). Then, it can be represented as an HW/PP-length sequence.
Each token position is also represented by an embedding. The authors compared the performance of 1D and 2D-aware position embedding but found that the 2D version showed no significant improvement. Therefore, they stuck with 1D position embedding. Similar to NLP, ViT also prepended a classification token [CLS] to each sequence to represent the whole sequence later.

For fine-tuning, it's known to better use higher-resolution images than pre-training. ViT uses the same patch size P, which means we have longer sequences, and the pre-training 1D position embedding is meaningless now. Therefore, they apply 2D interpolation [4] to infer the fine-tuning patch location based on pre-trained images.
Due to a lack of inductive bias, ViT initially performed poorly compared to SOTA. However, as the pre-training scale increased, ViT outperformed and achieved promising results on multiple image recognition benchmarks.

However, ViT ‘s fixed patch size P has a disadvantage: P needs to be large enough to perform global attention efficiently but also small enough to adapt to small-size visual entities. Swin-Transformer [5] cope with these challenges using hierarchical Transformers and shifted windows.
Swin Transformer
While ViT presented the promising results of applying Transformer to CV, Swin-Transformer has made several optimizations to better adapt the technique.
Hierarchical Transformer
Besides using ViT patches, Swin-Transformer partitioned images into "windows", each consisting of M*M patches, where M is the window size. The model started out with a small patch size P in the first layers and performed global attention inside each window only. Since M is constant, despite a small P, the attention's time complexity remains linear. We named this "window multi-head self-attention" (W-MSA).
Then, we gradually merge neighbouring patches in deeper Transformer layers. For example, starting with patch size P=4 and embedding size C, we have H/4 W/4 tokens. Each token is mapped into an embedding dimension C. In the Patch Merging of the next stage, 2×2 consecutive patches are merged, so we have fewer patches, H/8 W/8. 2×2 patch embeddings are concatenated into a 4C-dimension embedding and fed to a linear layer with an output dimension of 2C. That's how we got the new representation H/8 W/8 2C.

Shifted windows
In the image of Swin-Transformer architecture, we can see that each stage has 2 blocks of Swin Transformer. One is the regular W-MSA, and one is the shifted SW-MSA. So, what is SW-MSA, and why do we need it?
W-MSA allows us to pay attention to small patches efficiently, but it lacks interaction between patches that are neighbours but in different windows. Therefore, the author introduced "shift-window multi-head self-attention" (SW-MHA). This bridges the patches between windows, hence enhancing the modelling power of the Swin Transformer.
Given a window of size MM patches (M=4), we shift the window to (M/2, M/2) to form a new set of windows and perform self-attention inside them. As we can see in the picture above, such shifting results in more windows. Namely, the W-MSA partition has H/M H/M windows, while WS-MSA has (H/M+1) * (W/M+1) windows.

This may introduce extra time complexity, especially when H/M and W/M are small. Besides, some newly formed windows are not in full size, for example the cornered windows. Therefore, the authors took one step further to merge the "incomplete" windows into a "complete" one. In these "complete" windows, the attention between non-adjacent patches is masked accordingly to avoid irrelevant attention.

Relative position bias
Another interesting optimization in this work is relative position bias [6]. As mentioned above, ViT uses 1D position embedding, which represents the absolute position of each image patch in the sequence. Swin-Transformer, on the other hand, uses relative position between patches in the same window to improve the self-attention mechanism.
The image below intuitively explains the relative position in a 1D sequence. For example, the edge value from P2 to P1 is -1, while from P2 to P_n is n-2. Therefore, all the relative position biases can be represented through Table B, with the key values ranging from -n+1 (from P_n to P1) to n-1(from P1 to P_n).

For images, the edges are computed based on 2 axes, height and width, but follow a similar pattern. The authors of the original work [6] also pointed out that we can clip the maximum distance value to help the model generalize to new sequence lengths. In our cases, M is much smaller than the text sequence length, so we will not explore this factor in depth.
For Swin-Transformer, during the self-attention step, relative position bias is added to the attention score between patches for more locality information. The author also believed this can introduce translation invariance, which benefits general vision-modelling tasks.

References
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[2] https://en.wikipedia.org/wiki/Convolutional_neural_network
[3] Dosovitskiy, Alexey, et al. "An image is worth 16×16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[4] https://en.wikipedia.org/wiki/Interpolation
[5] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[6] Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. "Self-attention with relative position representations." arXiv preprint arXiv:1803.02155 (2018).