Self-attentive sentence embedding for the recommendation system

Introduction
The transformer layer and its attention mechanism are some of the most impactful ideas in the NLP community. They play a crucial role in many large language models, such as ChatGPT and LLaMA, which have recently taken the world by storm.
However, there is another interesting idea that originated from the NLP community, but its impact is mainly realized in the recommendation field: self-attentive sentence embedding. In this article, I will walk us through self-attentive sentence embedding [1] and how to apply it to the recommendation system.
How it works
Overall idea
The paper's main idea is to find a better way to encode a sentence into multiple embeddings that can capture various aspects of the sentence. Specifically, instead of encoding a sentence to a single embedding, the authors want to encode it into a 2D matrix, where each row embedding captures a different aspect of the sentence:

Once we have the sentence embeddings, we can use them for various downstream tasks, such as sentence analysis, author profiling, and textual entailment.
Model architecture
The model input is a batch of sentences; each sentence has n tokens. We can represent the ith sentence like this:

Let d denote the hidden dimension of the representation; we can encode the sentence s into an n by d matrix H as:

Where F denotes the model function to encode the tokens in the sentence into embeddings, in the paper, they encode the tokens using word embedding (initialized using Word2Vec) and then feed them through the bidirectional LSTM. Since there are many methods to encode tokens into embeddings, I use F here for generalization.
Next, they use the embedding H as an input to learn the attention weight matrix A:

Here, the softmax() is applied to the second dimension of its input. We can view the formula as a 2-layer MLP without bias.
As we can see from the above formula, the attention weight A matrix will have a shape of r by n where r is the number of aspects a sentence can have and n is the sentence length. The authors argue that there are many aspects that make up the semantics of a sentence. Thus, they need r embeddings to focus on different parts of the sentence. In other words, each embedding in A is the sentence attention weight:

Just like the Transformer, we can visualize this matrix A to have a deeper understanding of the attention each aspect has on the sentence.
Finally, we generate the sentence embedding by multiplying H with A to get the r by d matrix M:

Each row in M is the weighted sum between the token embedding and the weight an aspect has on that token. Visually, it looks something like this:

Regularization
In the paper, they also introduce a new regularization term:

Where F stands for the Frobenius norm of the matrix.
The regularization term serves 2 purposes:
- To increase the diversity since aspect embeddings can overlap, meaning they can be similar.
- To make each interest focus on as few tokens as possible.
Since regularization is not the focus of this article, we can read more about how it works in the paper.
Multi-interest in the recommendation system
Once we understand how self-attentive sentence embedding works, we can focus on how to use it in the Recommendation System.
In a large-scale recommendation system, we usually use a two-tower model architecture, where one tower encodes user information, and the other encodes candidate information. We use user past behaviour, such as sequences of click, like, and share, along with the user profile, as the input for the user tower. As for the candidate tower, we use candidate features such as item ID and item category.
We dot product the user embedding and the candidate embedding to reflect how relevant the candidate item is to the user. Our label is the next interacted item in the user sequence. Thus, the model objective is to predict the next items the user might interact with:

As we can see from the above image, the user tower's output is an embedding that captures all the user information. However, a single-user embedding is not good at capturing all the users' diverse interests. Thus, a better solution is to encode user interests into multiple embeddings.
Much research has been done on how to capture users' diverse interests. Two methods that stand out the most are self-attentive embedding (SA) [2] and dynamic routing (DR) [3]. Although both methods have comparable performance, the self-attentive method is more stable and faster to train.
Once we understand how the self-attentive method works, applying it to the recommendation field is trivial. Instead of using sentence tokens as input, we replace them with user behaviour, such as a list of video IDs that users have watched on YouTube or item IDs that users click/order on the E-Commerce platform. As for the output, each embedding encodes a user interest instead of an aspect in a sentence!

In the ComiRec paper [2], the authors compare the self-attentive method with the dynamic routing method along with other popular models that produce a single user interest:

As the table shows, the self-attentive method produces results comparable to those of the dynamic routing method. Still, both multi-interest embedding solutions are significantly better than their single-interest embedding counterparts.
Wrap up
There are many nuances to training and serving models with multi-interest embeddings. In this article, I walk us through how the self-attentive method works and how we use it in the recommendation system. For more detail on training and serving those models, there is no better resource than reading the paper. This article is a gentle reference on your journey to understand the multi-interest framework for recommendation systems.
Reference
[1] Lin, Zhouhan, et al. "A structured self-attentive sentence embedding." arXiv preprint arXiv:1703.03130 (2017).
[2] Cen, Yukuo, et al. "Controllable multi-interest framework for recommendation." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.
[3] Li, Chao, et al. "Multi-interest network with dynamic routing for recommendation at Tmall." Proceedings of the 28th ACM international conference on information and knowledge management. 2019.