Recommender Systems From Implicit Feedback Using TensorFlow Recommenders

Author:Murphy | View: 29751 | Time: 2025-03-23 12:10:33

RECOMMENDATION SYSTEM

Making recommendations is actually not that hard. You just have to check how your customers rated your products, for example using 1 to 5 stars, and then train a regression model on top of it. Right?

A typical dataset that you would like to have. Image by the author.

Okay, we might have to deal with embeddings if we don't have any numerical user or movie features, but we have seen how to do that in my earlier article:

Introduction to Embedding-Based Recommender Systems

We will also need embeddings in this article, so I suggest reading the article above before you continue.

Implicit Feedback

However, sometimes we are not in the lucky position of having explicit user feedback, such as stars, thumbs up or down, or similar. This happens quite a lot in retail, where we know which customer bought which item, but not if they actually liked it. The only things we get from the customers are implicit signals about their interest in this product.

If they bought (watched, consumed, …) the product, they have showed interest in it. If not, they were maybe not interested, but maybe just didn't know about it yet. We cannot tell.

This sounds like we can treat a classification problem. Interested = 1, not interested = 0. However, this is the small problem that we cannot be sure if a 0 (not interested) is really a zero. It can also be that the customer just never had the chance to buy it, but would actually like to.

Let us go back to the movies and assume that we don't have any ratings. We only know which user watched which movie.

Alice watched Gaußzilla and The Markov Chainsaw Massacre, for example. Image by the author.

There are at least two ways we can proceed from here.

Just treat all missing values as zero and then train a binary classifier.
Use a pairwise loss function to ensure that the similarity between a user and a movie they watched is higher than the similarity between the same user and a movie they did not watch.

Treat all missing values as zero

This is the simplest solution. From the incomplete table above, you would create the following dataset:

Note: A = Alice, B = Bob, C = Charlie, G = Gauß, E = Euler, M = Markov

You can interpret the Watched column as a label for whether the user is interested in the movie or not. From this table, you would deduce that, for example, user A does like movie G, but not movie E, which is a bold statement given the data. Maybe A does not know about E yet. Or even worse, it's actually on A's watchlist, but did not have time to watch it.

A problem on the technical side with this approach is that the model learns to say 0 to almost any (user, movie) input because most Watched values are typically zero. Imagine that you have a dataset of 1,000,000 users and 100,000 movies. How many different movies does the average user watch? Maybe 1000? Then you 1% of all Watched labels being 1. So you have a heavily imbalanced dataset, which is nothing bad per se. However, since we artificially create the zeros, it can lead to bad performance.

A computational problem is that this dataset gets huge. 1,000,000 users times 100,000 movies means you have a dataset with 100,000,000,000 rows. And often, you have more movies and items in your database. In this case, you don't put all the zero target rows into your dataset, but you subsample, also known as negative sampling. For example, if you have 1,000,000,000 rows with target 1 in your dataset (= transactions that happened), you could subsample 1,000,000,000 negative samples (= transactions that never happened) as well. Then you have a nice dataset you can train on.

This works, but often not optimally since you make the problem harder than it has to be. You don't have to predict a Watched label perfectly. You only want to rank movies for each user, i.e., you want to be able to say "User A likes movie G more than movie E". The second approach gives us just that.

Use a pairwise loss function

In this approach, we do not tell the model that a user does or does not like a specific movie. We phrase it more carefully:

If a user A watched a movie G, but did not watch another movie E, we only say that A is more interested in G than in E.

This allows us to tackle an easier target. Now, let us start with some formulas, so we can better understand how this intuition translates into an algorithm.

We will train a model that works with embeddings again. Let us assume that we have embeddings for user A, for movie G, and movie E. If A watched G, but not E, we simply want that

where the e‘s are the embeddings and · is the dot product. This implies that for user A, movie G is somehow better than movie E. But it is less drastic than saying "A likes G but does not like E" as in the binary classification case.

Training something like this sounds way more complicated than training a binary classifier, but several libraries get us covered. I will show you how to do it with TensorFlow Recommenders since this is the most flexible library that I know. Another library worth mentioning that is easy to use, but inflexible is implicit.

Training With TensorFlow Recommenders

We will now see how easy it is to put this logic described before into code. Just to play with open cards, **** I'm following the guide from the official TFRS website. I just tried to make it more concise.

Preparations and data generation

First, let us do a

pip install TensorFlow tensorflow-recommenders tensorflow-datasets

and then we can load some data via

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import tensorflow as tf

ratings = (
    tfds.load("movielens/100k-ratings", split="train")
    .map(lambda x: {
        "movie_title": x["movie_title"],
        "user_id": x["user_id"],
    })
    .shuffle(10000)
)

ratings_df = tfds.as_dataframe(ratings)

ratings is a TensorFlow dataset that is always a bit tedious to handle. For memory-efficient training of huge datasets, you have to use it, though. But for our small example, I try to stay in the friendly dataframe world as much as possible, hence I convert the dataset to the dataframe ratings_df . The data looks like this:

Model definition

We will build a model that has two parts:

a user model
a movie model

These models should take a user or movie respectively, and turn it into an embedding, i.e., a bunch of floats.

embedding_dimension = 32

user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(vocabulary=ratings_df["user_id"].unique()),
  tf.keras.layers.Embedding(ratings_df["user_id"].nunique() + 1, embedding_dimension)
])

movie_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(vocabulary=ratings_df["movie_title"].unique()),
  tf.keras.layers.Embedding(ratings_df["movie_title"].nunique() + 1, embedding_dimension)
])

Using these two components, we can define the complete model like this:

class MovielensModel(tfrs.Model):
    def __init__(self, user_model, movie_model, task):
        super().__init__()
        self.movie_model = movie_model
        self.user_model = user_model
        self.task = task

    def compute_loss(self, features, training=False):
        user_embeddings = self.user_model(features["user_id"])
        positive_movie_embeddings = self.movie_model(features["movie_title"])

        return self.task(user_embeddings, positive_movie_embeddings)

We can see how the model consists of user_model and movie_model . We will set this task attribute to a retrieval task, which is implementing exactly what we want. There is also another type of task, called a ranking task that you can use when you have explicit feedback such as ratings. We will not look into this further in this article.

You can also see that some loss is computed. The input is a dictionary named features that is supposed to look like this:

features = {
    "user_id": ["A", "B", "C"],
    "movie_title": ["G", "E", "M"],
}

It contains a bunch of user IDs as well as a bunch of movie titles. In this example, user A watched G, user B watched E, and user C watched M. We only have positive examples here, i.e., movie sessions that happened in the past.

The users and movies are turned into embeddings and then some loss is computed. I will go into detail later, but be assured that it is doing what we want it to do.

The architecture of the model is like in my other article:

Fitting the model

We can use a nice TFRS prediction class in the end, but for it to work, we need a unique movie list as a TensorFlow dataset.

# a TensorFlow dataset
unique_movies = tf.data.Dataset.from_tensor_slices(ratings_df["movie_title"].unique())

Using this dataset, we can define the task that I talked about before:

task = tfrs.tasks.Retrieval()

We can now fit the model!

model = MovielensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

model.fit(ratings.batch(10000).cache(), epochs=5)

You will see something like this in the end:

It is hard to give meaning to the loss , but the smaller the better. I will go into a bit more detail soon, but let us use our model first to predict some movies!

Prediction time

First, you have to define something called an index. You can then use this index to get predictions.

index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)

index.index_from_dataset(
  tf.data.Dataset.zip((unique_movies.batch(100), unique_movies.batch(100).map(model.movie_model)))
)

Note: You don't use model anymore. You only used it to adjust the parameters for user_model and movie_model , and now you use these two submodels directly. You can basically throw the TFRS model model away at this point.

Now, you can pass a user to index . The index will

turn this user ID into an embedding – this is possible because you passed it user_model – and then
compute the embedding for each movie – this is possible because of executing the index_from_dataset function – and then
output the movie titles whose embeddings are closest to the user embedding.

Note: It is doing an exact nearest neighbor search under the hood, which can be slow. It also supports an approximate nearest neighbor search using ScaNN. You can use it by typing ScaNN instead of BruteForce .

It works like this:

_, titles = index(tf.constant(["99"]))
print(f"Recommendations for user 99: {titles[0, :3]}")

# Output:
# Recommendations for user 99: [b'Sunset Park (1996)' b'Happy Gilmore (1996)' b'High School High (1996)']

Nice! You are now ready to use the model.

Loose ends

There is still something I promised I would get into: this task and the loss it outputs. It is not very well documented (yet) what is happening inside, but I looked at the source code to see what is going on. You can find the source code I'm referring to here.

I will explain it to you using a small example batch.

The batch going into the model. Image by the author.

It goes into the model, and then embeddings for each user and movie are created using user_model and movie_model . These embeddings go into the retrieval task object.

In the task , all user embeddings are multiplied (dot product) with all movie embeddings. This can be done by a simple matrix multiplication, where the movie matrix is transposed first. Let us assume that we use two-dimensional embeddings to save some space.

The matrix product is

Now, the reasoning is as follows: From the data, we know that

Alice watched Gauß,
Bob watched Euler, and
Charlie watched Markov.

That's why we would like the corresponding numbers in the cells (A, G), (B, E), (C, M) – this is the main diagonal – to have the highest numbers. In this small example, we are way off.

To quantify this, they do another step: the authors of TFRS do a row-wise softmax.

After a row-wise softmax. Note that the sum of each row is 1. Image by the author.

Now, the observation is: if the elements on the main diagonal in the previous matrix are way higher than the other numbers, then the "softmaxed" matrix is close to the identity matrix.

The optimal identity matrix. Image by the author.

This is because if you take the softmax of an array if one number is way higher than the others, this number will be close to 1. So the other numbers must be close to zero. Just try it out:

x = np.array([1, 2, 10])
np.exp(x) / np.exp(x).sum() # softmax

# Output:
# array([1.23353201e-04, 3.35308764e-04, 9.99541338e-01])

So, the loss then comes from comparing the matrix from above with the identity matrix. To be precise, the categorical cross-entropy loss is used. But not the mean across the rows, but the sum, see here. That's why the loss numbers are always so high. The larger the batches are, the larger the losses will be. So don't be confused if the losses are suddenly super low, just because you changed the batch size from 10,000 to 1,000 or something similar.

Conclusion

In this article, we have learned how to utilize implicit feedback data to build a recommender system. To do this, we used TensorFlow Recommenders since it scales well, and is very expressive: you can take any submodel – as long as it outputs an embedding – and stick them together to jointly train them using the tfrs.Modelclass.

After training, you can use a convenient class to make actual predictions. If you use ScaNN, this should be quite fast, but if you need a search on steroids, you can use dedicated vector databases like Qdrant. You give it the user and movie embeddings from the trained models, and it does the search for you.

We have also taken a glimpse into the internals of the library to understand where the to-be-minimized loss is coming from, so this library is no pure magic anymore.

If you want to learn how to assess the quality of an implicit feedback recommender, please refer to my other article: