Introduction to Embedding-Based Recommender Systems

Author:Murphy | View: 20306 | Time: 2025-03-23 20:03:37

Recommendation System

They are everywhere: these sometimes fantastic, sometimes poor, and sometimes even funny recommendations on major websites like Amazon, Netflix, or Spotify, telling you what to buy, watch or listen to next. While recommender systems are convenient for us users – we get inspired to try new things – the companies especially benefit from them.

To understand to which extent, let us take a look at some numbers from the paper Measuring the Business Value of Recommender Systems by Dietmar Jannach and Michael Jugovac [1]. From their paper:

Netflix: "75 % of what people watch is from some sort of recommendation" (this one is even from Medium!)
Youtube: "60 % of the clicks on the home screen are on the recommendations"
Amazon: "about 35 % of their sales originate from cross-sales (i.e., recommendation)", where their means Amazon

In this paper [1] you can find more interesting statements about increased CTRs, engagement, and sales that you can get from employing recommender systems.

So, it seems like recommenders are the greatest thing since sliced bread, and I also agree that recommenders are one of the best and most interesting things that emerged from the field of Machine Learning. That's why in this article, I want to show you

how to design an easy collaborative recommender (matrix factorization)
how to implement it in TensorFlow
what the advantages and disadvantages are.

You can find the code on my Github.

Before we start, let us grab some data we can play with.

Getting the Data

If you don't have it yet, get tensorflow_datasets via pip install tensorflow-datasets . You can download any dataset they offer, but we will stick to a true classic: movielens! We take the smallest version of the movielens data consisting of 1,000,000 rows, so training is faster later.

import tensorflow_datasets as tfds

data = tfds.load("movielens/1m-ratings")

data is a dictionary containing TensorFlow DataSets, which are great. But to keep it simpler, let's cast it into a pandas dataframe, so everyone is on the same page.

Note: Usually, you would keep it as a TensorFlow dataset, especially if the data gets even larger since pandas is extremely hungry on your RAM. Do not try do convert it to a pandas dataframe for the 25,000,000 version of the movielens dataset!

df = tfds.as_dataframe(data["train"])
print(df.head(5))

⚠️ Warning: Don't print the entire dataframe since this is a styled dataframe that's configured to display all 1,000,000 rows by default!

We can see an abundance of data. Each row consists of a

user (user_id),
a movie (movie_id),
the rating that the user gave to the movie (user_rating), expressed as an integer between 1 and 5 (stars), and
a lot more features about the user and movie.

_In this tutorial, let us only use the bare minimum: user_id, movie_id, and user_rating since very often this is the only data we have. Having more features about users and movies is usually a luxary, so let us directly deal with the harder, but broadly applicable case. Recommender trained on this kind of interaction data are called collaborative – a model is trained on the interactions of many users to make recommendations for a single user._ One for all, all for one!

We will also keep the timestamp to conduct a temporal train-test split since this resembles how we train in real life: we train now, but we want the model to work well tomorrow. So we should evaluate the model quality like this as well.

filtered_data = (
    df
    .filter(["timestamp", "user_id", "movie_id", "user_rating"])
    .sort_values("timestamp")
    .astype({"user_id": int, "movie_id": int, "user_rating": int}) # nicer types
    .drop(columns=["timestamp"]) # don't need the timestamp anymore
)

train = filtered_data.iloc[:900000] # chronologically first 90% of the dataset
test = filtered_data.iloc[900000:]  # chronologically last 10% of the dataset

filtered_data contains

Cold Start Problem

If we split the data in any way, we may run into something called the cold start problem, meaning that some users or movies are only present in the test set, but not in the training set. In our case, funnily enough, user 1 is such an example.

print(train.query("user_id == 1").shape[0])
print(test.query("user_id == 1").shape[0])

# Output:
# 0
# 53

It is a bit like a category of a categorical feature that only appears in the test set. It makes learning harder, but still, the model has to deal with it somehow. The recommender that we will build soon is quite prone to the cold start problem, but there are other types of recommenders that can deal with new users or movies in a better way. This is something for another article, though.

Let's build train and test dataframes and move on.

X_train = train.drop(columns=["user_rating"])
y_train = train["user_rating"]

X_test = test.drop(columns=["user_rating"])
y_test = test["user_rating"]

Embeddings Crash Course

Now that we know what the data looks like, let us define the model signature, meaning what goes in and what comes out. In our case, it is quite simple: The input should be a user_id and a movie_id, and the output should be the user_rating, i.e. how the user rates the movie.

But what could such a model look like? This is a tough one, especially for data science beginners. The users and movies are categories, even if we encoded them as integers. So, treating them like numbers and merely training a model with them is not purposeful.

Something Horrible!

For the curious readers, I will do it anyway. The following is an example of how not to do it:

# BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

hgb = HistGradientBoostingRegressor(random_state=0)
hgb.fit(X_train, y_train)
print(hgb.score(X_test, y_test), mean_absolute_error(y_test, hgb.predict(X_test)))

# Output:
# 0.07018701410615702 0.8508620798953698

# BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD

The _r_² is about 0.07, which is about as good as a regressor that only outputs the mean of the ratings, independently of the user and movie inputs. The mean absolute error is about 0.85, meaning that we miss the true rating by about 0.85 stars on average.

Instead of doing it like this, I will show you how to use embeddings to build a more meaningful and better model.

One-Hot Encoding as a Special Case of Embeddings

One way to encode categorical variables such as our users or movies is with vectors, i.e. a tuple of numbers – called embeddings in this context. This is a useful technique to keep in mind, not only for Recommender Systems but whenever you deal with categorical data.

A very simple example of turning categories into numbers is one-hot/dummy encoding. However, ** the resulting embeddings are high-dimensiona**l for high-cardinality categorical features, leading us right into the curse of dimensionality trap when trying to work with them.

Another drawback is that each pair of two vectors have the same distance from each other. As an example, if you take a feature with three categories that are encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1], each category has the same distance to each other category in common metrics such as Euclidean and other Mikowski distances, or cosine similarity. This might be fine for nominal features, but for ordinal features such as hot, mild, or cold weather, it would be nicer if hot is closer to mild than cold.

Clearly, this is bad, so we have to think of something different.

The Real Deal

Embeddings allow us to create shorter vectors with more meaning than one-hot encoded vectors.

They are readily available within deep learning frameworks such as TensorFlow and PyTorch. On a very high level, they work like this:

You specify an embedding dimension, i.e. how long the vector should be. This is a hyperparameter that you could tune, among others.
The embeddings for each category get initialized randomly, just as any other weight in your neural network.
Training pushes the embeddings to be more useful to the model.

This is actually not a conceptionally new operation since you can simulate it by first one-hot encoding the category and then using a linear (dense) layer without activation function and bias. The embedding layer is just more performant since it's just doing a lookup instead of calculating a matrix product as done in the linear layer.

Building the Model

So, now that we have all of the ingredients, let's build a model! First, we will define the high-level architecture of the model, and then we will build it in TensorFlow, although it is similarly easy in PyTorch if you prefer this.

Architecture

Alright, so two categorical variables (user_id and movie_id) enter the model, then we embed them. We end up with two vectors, preferably of the same length. In the end, we want to end up with a single number, the user_rating.

Note: We will model it as a regression problem, but you can also see it as a classification task.

So, how can we make a single number out of two vectors of the same length? There are many ways, but one of the easiest and most efficient ones is by just taking the dot product.

Note: The approach that we will be taking in the following is also called matrix factorization since we compute dot products all over the place, just as if you multiply two matrices.

Nothing too crazy, I would argue. Now we are able to look at how the model should work:

As a formula, we created this:

which reads as "the rating of movie m from user u equals embedding of user u dot product embedding of movie m".

Implementation in TensorFlow, Version One

The implementation is actually a piece of cake if you know basic TensorFlow. The only thing to pay attention to is that the embedding layers want the categories to be represented as integers from 1 to _number_ofcategories. Very often you find people populating some dictionary like {"user_8323": 1, "user_1122": 2, …} and an inverse dictionary like {1: "user_8323", 2: "user_1122", …} to achieve this, but TensorFlow has some nice layers to take care of them as well. We will use the [IntegerLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/IntegerLookup) here. A nice feature of this layer: unknown categories get mapped to 0 by default.

Before we start, we have to grab all the unique users and movies from the training set first.

all_users = train["user_id"].unique()
all_movies = train["movie_id"].unique()

Using the functional API of Keras, you can implement the above ideas like this:

import tensorflow as tf

# user pipeline
user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users)+1, output_dim=32)(user_as_integer)

# movie pipeline
movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies)+1, output_dim=32)(movie_as_integer)

# dot product
dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
flatten = tf.keras.layers.Flatten()(dot)

# model input/output definition
model = tf.keras.Model(inputs=[user_input, movie_input], outputs=flatten)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

Since we gave the user and movie input layers nice names, we can train the model like this:

model.fit(
    x={
        "user": X_train["user_id"],
        "movie": X_train["movie_id"]
    },
    y=y_train.values,
    batch_size=256,
    epochs=100,
    validation_split=0.1, # for early stopping
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=1, restore_best_weights=True)
    ],
)

# Output (for me):
# ...
# Epoch 18/100
# 3165/3165 [==============================] - 8s 3ms/step - loss: 0.7357 - mean_absolute_error: 0.6595 - val_loss: 11.4699 - val_mean_absolute_error: 2.9923

We could evaluate this model on the test set now, but we can already see here that it's probably quite bad because val_mean_absolute_error is about 3. That means that we are on average 3 stars off, which is horrible in a 5-star system. This is even worse than our bad model from before, which is quite an achievement.

Tags: Artificial Intelligence Editors Pick Machine Learning Recommendation System Recommender Systems