Building a Recommender System using Machine Learning

Author:Murphy | View: 21368 | Time: 2025-03-23 19:34:43

The Kaggle Blueprints

"An excellent choice, madam! Our burger pairs perfectly with a side and a drink. May I suggest some options?" (Image by the author)

Welcome to the first edition of a new article series called "The [Kaggle](https://www.kaggle.com/) Blueprints", where we will analyze Kaggle competitions' top solutions for lessons we can apply to our own Data Science projects.

This first edition will review the techniques and approaches from the "OTTO – Multi-Objective Recommender System" competition, which ended at the end of January, 2023.

Problem Statement: Multi-Objective Recommender System

The goal of the "OTTO – Multi-Objective Recommender System" competition was to build a multi-objective recommender system (RecSys) based on a large dataset of implicit user data.

Specifically, in the e-commerce use case, competitors were dealing with the following details:

multi-objective: clicks, cart additions, and orders
large dataset: over 200 million events for about 1,8 million items
implicit user data: previous events in a user session

OTTO – Multi-Objective Recommender System

How to Approach a RecSys for a Large Database of Items

One of the main challenges of this competition was the large number of items to choose from. Feeding all of the available information into a complex model would require the availability of extensive amounts of computational resources.

Thus, the general baseline most competitors of this challenge followed is the two-stage candidate generation/rerank technique [3]:

Stage: candidate generation – This step reduces the number of potential recommendations (candidates) for each user from millions to about 50 to 200 [2]. To handle the amount of data, a simple model is usually used for this step.
Stage: reranking – You can use a more complex model for this step, such as an Machine Learning (ML) model. Once you have ranked your reduced candidates, you can select the highest-ranked items as recommendations.

Two-stage recommender candidate generation/rerank technique (Image by author, inspired by [3])

Stage 1: Candidate Generation with Co-Visitation Matrix

The first step of the two-stage approach is to reduce the number of potential recommendations (candidates) from millions to about 50 to 200 [2]. To deal with the large number of items, the first model should be simple [5].

You can choose and combine different strategies to reduce the number of items [3]:

by user history
by popularity – this strategy can also serve as a strong baseline [5]
by co-occurrence based on a co-visitation matrix

The most straightforward approach to generate candidates is to use the user history: If a user has viewed an item, they are likely to purchase it as well.

However, if the user has viewed fewer items (e.g., five items) than the number of candidates we want to generate per user (e.g., 50 to 200), we can populate the list of candidates by item popularity or co-occurrence [7]. Since selection by popularity is straightforward, we will focus on candidate generation by co-occurrence in this section.

The candidate generation by the co-occurrence of two items can be approached with a co-visitation matrix: If user_1 bought item_a and shortly after item_b , we store this information [6, 7].

Minimal example of users' buying behavior for recommender system (Image by the author)

For each item, count the occurrences of every other item within a specified time frame.

Minimal example of co-visitation matrix (Image by the author)

For each item, find the 50 to 200 most frequent items visited after this item.

As you can see from the image above, a co-visitation matrix is not necessarily symmetrical. For example, someone who bought a burger is also likely to buy a drink – but the opposite may not be true.

You can also assign weights to the co-visitation matrix based on proximity. For example, items bought together in the same session could have a higher weight than items a user bought across different shopping sessions.

The co-visitation matrix resembles doing matrix factorization by counting [6]. Matrix factorization is a popular technique for Recommender Systems. Specifically, it is a collaborative filtering method that finds the relationship between items and users.

Recommendation System – Matrix Factorization

Stage 2: Reranking with GBDT Model

The second step is reranking. While you can achieve a good performance with handcrafted rules [1], in theory, using an ML model should work better [5].

You can use different Gradient Boosted Decision Tree (GBDT) rankers like XGBRanker or LGBMRanker [2, 3, 4].

Preparation of training data and feature engineering

The training data for the GBDT ranker model should contain the following column categories [2]:

User and item pairs from candidate generation – The base for the dataframe will be the list of candidates generated in the first stage. For each user, you should end up with N_CANDIDATES , and thus, the starting point should be a dataframe of shape (N_USERS * N_CANDIDATES, 2)
User features -counts, aggregation features, ratio features, etc.
Item features -counts, aggregation features, ratio features, etc.
User-item features (optional)— You can create user-item interfaction features, such as ‘item clicked'
Labels – For each user-item pair, merge the labels (e.g., ‘bought' or ‘not bought').

The resulting training dataframe should look something like this.

Training data structure for training a GDBT ranker model for a recommender system (Image by the author)

GBDT ranker models

This step aims to train a GBDT ranker model to select the top_N recommendations.

The GBDT ranker will take three inputs:

X_train, X_val: training and validation data frames containing FEATURES
y_train, y_val: training and validation data frames containing LABELS
group : Note that the FEATURES don't contain user, item columns [2]. Thus, the model needs the information on within which group to rank the items: group = [N_CANDIDATES] * (len(train_df) // N_CANDIDATES)

Below you can find the sample code with XGBRanker [2].

import xgboost as xgb

dtrain = xgb.DMatrix(X_train,
                     y_train, 
                     group = group) 

# Define model
xgb_params = {'objective' : 'rank:pairwise'} 

# Train
model = xgb.train(xgb_params, 
                  dtrain = dtrain,
                  num_boost_round = 1000)

Below you can find the sample code with LGBMRanker [4]:

from lightgbm.sklearn import LGBMRanker

# Define model
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    n_estimators=1000)

# Train
model = ranker.fit(X_train, 
                   y_train,
                   group = group)

The GBDT ranking model will rank the items within the specified group. To retrieve the top_N recommendations, you only need to group the output by the user and sort by the item's ranking.

Summary

There are many more lessons to be learned from reviewing the learning resources Kagglers have created during the course of the "OTTO – Multi-Objective Recommender System" competition. There are also many different solutions for this type of problem statement.

In this article, we focused on the general approach that was popular among many competitors: Candidate generation with a co-visitation matrix to reduce the number of potential items to recommend, followed by a GBDT reranker.

Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Find me on LinkedIn, Twitter, and Kaggle!

References

[1] Chris Deotte (2022). "Candidate ReRank Model – [LB 0.575]" in Kaggle Notebooks. (accessed 26. February 2023)

[2] Chris Deotte (2022). "How To Build a GBT Ranker Model" in Kaggle Discussions. (accessed 21. February 2023)

[3] Ravi Shah (2022). "Recommendation Systems for Large Datasets" in Kaggle Discussions. (accessed 21. February 2023)

[4] Radek Osmulski (2022). "[polars] Proof of concept: LGBM Ranker" in Kaggle Notebooks. (accessed 26. February 2023)

[5] Radek Osmulski (2022). " Introduction to the OTTO competition on Kaggle (RecSys)" on YouTube. (accessed 21. February 2023)

[6] Radek Osmulski (2022). "What is the co-visitation matrix, really?" in Kaggle Discussions. (accessed 21. February 2023)

[7] Vladimir Slaykovskiy (2022). "Co-visitation Matrix" in Kaggle Notebooks. (accessed 21. February 2023)

Tags: Artificial Intelligence Data Science Editors Pick Recommender Systems The Kaggle Blueprints

Add Fav

Comment

Murphy

Recommend

◦ Strategizing Your Preparation for Machine Learning Interviews

◦ Understanding the Limitations of ARIMA Forecasting

◦ The Role of Data Science in Democratizing AI

◦ Your Data Science Visualizations Will Never Be The Same – Plotly & Dash

◦ How to Measure the Success of Your RAG-based LLM System

◦ How to Automatically Extract and Label Data Points on a Seaborn KDE Plot

◦ Why Machine Learning Is Not Made for Causal Estimation

◦ The "Who Does What" Guide To Enterprise Data Quality

◦ Monitoring unstructured data for LLM and NLP

◦ Step-by-Step Guide to Creating Simulated Data in Python

◦ An Undeservedly Forgotten Correlation Coefficient

◦ Where to Start when Data is Limited: A Guide