Recommender Systems From Implicit Feedback Using TensorFlow Recommenders
RECOMMENDATION SYSTEM

Making recommendations is actually not that hard. You just have to check how your customers rated your products, for example using 1 to 5 stars, and then train a regression model on top of it. Right?

Okay, we might have to deal with embeddings if we don't have any numerical user or movie features, but we have seen how to do that in my earlier article:
We will also need embeddings in this article, so I suggest reading the article above before you continue.
Implicit Feedback
However, sometimes we are not in the lucky position of having explicit user feedback, such as stars, thumbs up or down, or similar. This happens quite a lot in retail, where we know which customer bought which item, but not if they actually liked it. The only things we get from the customers are implicit signals about their interest in this product.
If they bought (watched, consumed, …) the product, they have showed interest in it. If not, they were maybe not interested, but maybe just didn't know about it yet. We cannot tell.
This sounds like we can treat a classification problem. Interested = 1, not interested = 0. However, this is the small problem that we cannot be sure if a 0 (not interested) is really a zero. It can also be that the customer just never had the chance to buy it, but would actually like to.
Let us go back to the movies and assume that we don't have any ratings. We only know which user watched which movie.

There are at least two ways we can proceed from here.
- Just treat all missing values as zero and then train a binary classifier.
- Use a pairwise loss function to ensure that the similarity between a user and a movie they watched is higher than the similarity between the same user and a movie they did not watch.
Treat all missing values as zero
This is the simplest solution. From the incomplete table above, you would create the following dataset:
Note: A = Alice, B = Bob, C = Charlie, G = Gauß, E = Euler, M = Markov

You can interpret the Watched column as a label for whether the user is interested in the movie or not. From this table, you would deduce that, for example, user A does like movie G, but not movie E, which is a bold statement given the data. Maybe A does not know about E yet. Or even worse, it's actually on A's watchlist, but did not have time to watch it.
A problem on the technical side with this approach is that the model learns to say 0 to almost any (user, movie) input because most Watched values are typically zero. Imagine that you have a dataset of 1,000,000 users and 100,000 movies. How many different movies does the average user watch? Maybe 1000? Then you 1% of all Watched labels being 1. So you have a heavily imbalanced dataset, which is nothing bad per se. However, since we artificially create the zeros, it can lead to bad performance.
A computational problem is that this dataset gets huge. 1,000,000 users times 100,000 movies means you have a dataset with 100,000,000,000 rows. And often, you have more movies and items in your database. In this case, you don't put all the zero target rows into your dataset, but you subsample, also known as negative sampling. For example, if you have 1,000,000,000 rows with target 1 in your dataset (= transactions that happened), you could subsample 1,000,000,000 negative samples (= transactions that never happened) as well. Then you have a nice dataset you can train on.
This works, but often not optimally since you make the problem harder than it has to be. You don't have to predict a Watched label perfectly. You only want to rank movies for each user, i.e., you want to be able to say "User A likes movie G more than movie E". The second approach gives us just that.
Use a pairwise loss function
In this approach, we do not tell the model that a user does or does not like a specific movie. We phrase it more carefully:
If a user A watched a movie G, but did not watch another movie E, we only say that A is more interested in G than in E.
This allows us to tackle an easier target. Now, let us start with some formulas, so we can better understand how this intuition translates into an algorithm.
We will train a model that works with embeddings again. Let us assume that we have embeddings for user A, for movie G, and movie E. If A watched G, but not E, we simply want that

where the e‘s are the embeddings and · is the dot product. This implies that for user A, movie G is somehow better than movie E. But it is less drastic than saying "A likes G but does not like E" as in the binary classification case.
Training something like this sounds way more complicated than training a binary classifier, but several libraries get us covered. I will show you how to do it with TensorFlow Recommenders since this is the most flexible library that I know. Another library worth mentioning that is easy to use, but inflexible is implicit.
Training With TensorFlow Recommenders
We will now see how easy it is to put this logic described before into code. Just to play with open cards, **** I'm following the guide from the official TFRS website. I just tried to make it more concise.
Preparations and data generation
First, let us do a
pip install TensorFlow tensorflow-recommenders tensorflow-datasets
and then we can load some data via
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import tensorflow as tf
ratings = (
tfds.load("movielens/100k-ratings", split="train")
.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
})
.shuffle(10000)
)
ratings_df = tfds.as_dataframe(ratings)
ratings
is a TensorFlow dataset that is always a bit tedious to handle. For memory-efficient training of huge datasets, you have to use it, though. But for our small example, I try to stay in the friendly dataframe world as much as possible, hence I convert the dataset to the dataframe ratings_df
. The data looks like this:

Model definition
We will build a model that has two parts:
- a user model
- a movie model
These models should take a user or movie respectively, and turn it into an embedding, i.e., a bunch of floats.
embedding_dimension = 32
user_model = tf.keras.Sequential([
tf.keras.layers.StringLookup(vocabulary=ratings_df["user_id"].unique()),
tf.keras.layers.Embedding(ratings_df["user_id"].nunique() + 1, embedding_dimension)
])
movie_model = tf.keras.Sequential([
tf.keras.layers.StringLookup(vocabulary=ratings_df["movie_title"].unique()),
tf.keras.layers.Embedding(ratings_df["movie_title"].nunique() + 1, embedding_dimension)
])
Using these two components, we can define the complete model like this:
class MovielensModel(tfrs.Model):
def __init__(self, user_model, movie_model, task):
super().__init__()
self.movie_model = movie_model
self.user_model = user_model
self.task = task
def compute_loss(self, features, training=False):
user_embeddings = self.user_model(features["user_id"])
positive_movie_embeddings = self.movie_model(features["movie_title"])
return self.task(user_embeddings, positive_movie_embeddings)
We can see how the model consists of user_model
and movie_model
. We will set this task
attribute to a retrieval task, which is implementing exactly what we want. There is also another type of task, called a ranking task that you can use when you have explicit feedback such as ratings. We will not look into this further in this article.
You can also see that some loss is computed. The input is a dictionary named features
that is supposed to look like this:
features = {
"user_id": ["A", "B", "C"],
"movie_title": ["G", "E", "M"],
}
It contains a bunch of user IDs as well as a bunch of movie titles. In this example, user A watched G, user B watched E, and user C watched M. We only have positive examples here, i.e., movie sessions that happened in the past.
The users and movies are turned into embeddings and then some loss is computed. I will go into detail later, but be assured that it is doing what we want it to do.
The architecture of the model is like in my other article:

Fitting the model
We can use a nice TFRS prediction class in the end, but for it to work, we need a unique movie list as a TensorFlow dataset.
# a TensorFlow dataset
unique_movies = tf.data.Dataset.from_tensor_slices(ratings_df["movie_title"].unique())
Using this dataset, we can define the task that I talked about before:
task = tfrs.tasks.Retrieval()
We can now fit the model!
model = MovielensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
model.fit(ratings.batch(10000).cache(), epochs=5)
You will see something like this in the end:

It is hard to give meaning to the loss
, but the smaller the better. I will go into a bit more detail soon, but let us use our model first to predict some movies!
Prediction time
First, you have to define something called an index. You can then use this index to get predictions.
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index_from_dataset(
tf.data.Dataset.zip((unique_movies.batch(100), unique_movies.batch(100).map(model.movie_model)))
)
Note: You don't use model
anymore. You only used it to adjust the parameters for user_model
and movie_model
, and now you use these two submodels directly. You can basically throw the TFRS model model
away at this point.
Now, you can pass a user to index
. The index will
- turn this user ID into an embedding – this is possible because you passed it
user_model
– and then - compute the embedding for each movie – this is possible because of executing the
index_from_dataset
function – and then - output the movie titles whose embeddings are closest to the user embedding.
Note: It is doing an exact nearest neighbor search under the hood, which can be slow. It also supports an approximate nearest neighbor search using ScaNN. You can use it by typing ScaNN
instead of BruteForce
.
It works like this:
_, titles = index(tf.constant(["99"]))
print(f"Recommendations for user 99: {titles[0, :3]}")
# Output:
# Recommendations for user 99: [b'Sunset Park (1996)' b'Happy Gilmore (1996)' b'High School High (1996)']
Nice! You are now ready to use the model.
Loose ends
There is still something I promised I would get into: this task
and the loss it outputs. It is not very well documented (yet) what is happening inside, but I looked at the source code to see what is going on. You can find the source code I'm referring to here.
I will explain it to you using a small example batch.

It goes into the model, and then embeddings for each user and movie are created using user_model
and movie_model
. These embeddings go into the retrieval task object.
In the task
, all user embeddings are multiplied (dot product) with all movie embeddings. This can be done by a simple matrix multiplication, where the movie matrix is transposed first. Let us assume that we use two-dimensional embeddings to save some space.

The matrix product is

Now, the reasoning is as follows: From the data, we know that
- Alice watched Gauß,
- Bob watched Euler, and
- Charlie watched Markov.
That's why we would like the corresponding numbers in the cells (A, G), (B, E), (C, M) – this is the main diagonal – to have the highest numbers. In this small example, we are way off.
To quantify this, they do another step: the authors of TFRS do a row-wise softmax.

Now, the observation is: if the elements on the main diagonal in the previous matrix are way higher than the other numbers, then the "softmaxed" matrix is close to the identity matrix.

This is because if you take the softmax of an array if one number is way higher than the others, this number will be close to 1. So the other numbers must be close to zero. Just try it out:
x = np.array([1, 2, 10])
np.exp(x) / np.exp(x).sum() # softmax
# Output:
# array([1.23353201e-04, 3.35308764e-04, 9.99541338e-01])
So, the loss then comes from comparing the matrix from above with the identity matrix. To be precise, the categorical cross-entropy loss is used. But not the mean across the rows, but the sum, see here. That's why the loss numbers are always so high. The larger the batches are, the larger the losses will be. So don't be confused if the losses are suddenly super low, just because you changed the batch size from 10,000 to 1,000 or something similar.
Conclusion
In this article, we have learned how to utilize implicit feedback data to build a recommender system. To do this, we used TensorFlow Recommenders since it scales well, and is very expressive: you can take any submodel – as long as it outputs an embedding – and stick them together to jointly train them using the tfrs.Model
class.
After training, you can use a convenient class to make actual predictions. If you use ScaNN, this should be quite fast, but if you need a search on steroids, you can use dedicated vector databases like Qdrant. You give it the user and movie embeddings from the trained models, and it does the search for you.
We have also taken a glimpse into the internals of the library to understand where the to-be-minimized loss is coming from, so this library is no pure magic anymore.
If you want to learn how to assess the quality of an implicit feedback recommender, please refer to my other article:
I hope that you learned something new, interesting, and valuable today. Thanks for reading!
If you have any questions, write me on LinkedIn!
And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I'm still searching for writers!