Memory-Efficient Embeddings

Author:Murphy | View: 25652 | Time: 2025-03-22 23:31:57

Photo by Kostiantyn Vierkieiev on Unsplash

Whenever dealing with categorical data, beginners resort to one-hot encoding. This is often okay, but if you are dealing with thousands or even millions of categories, this approach becomes infeasible. This has the following reasons:

Increased dimensionality: For each category, you get an additional feature. This can lead to the curse of dimensionality. The data becomes more sparse, and the model may suffer from increased computational complexity and decreased generalization performance.
Loss of semantics: One-hot encoding treats each category as an independent feature, ignoring any potential semantic relationships between categories. We lose meaningful relationships present in the original categorical variable.

These problems occur in the area of natural language processing (we have a bunch of words) or recommendation systems (we have a bunch of customers and/or articles) and can be overcome with the help of embeddings. However, if you have many of these embeddings, the memory requirements for your model can skyrocket to several gigabytes.

In this article, I want to show you several ways to decrease this memory footprint. One of these ways comes from an interesting paper Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems by Shi et al. We will also do some experiments to see how these methods fare in a rating prediction task.

Embeddings

In short, instead of long, sparse vectors, we want short, dense vectors of some length d – our embeddings. The embedding dimension d is a hyperparameter we can freely choose ourselves.

Left: one-hot encodings, right: embeddings. Image by the author.

Embeddings are real-valued vectors of the categories that are short and should capture some meaning of the categories. The real numbers are learned during training. For a concrete implementation in TensorFlow, check out my other article:

Introduction to Embedding-Based Recommender Systems

Model size

Embeddings are great, and once you have used them you never want to go back again. However, if you have a lot of categories, you have to store a bunch of real numbers for each of them. You can imagine a small embedding table for 6 categories of dimension 4 like this:

In general, for N categories and dimension d, the table has a size of N ⋅ d, and this is where the memory requirement comes from.

The size of this table can get out of hand very easily: in my job, I have built an embedding-based recommender system that serves more than a million customers. The recommender's size is often about 700 MB, which made it difficult to deploy the model in some places, e.g., BigQuery. Here, you can read that models are only allowed to have a maximum size of 450 MB. So, let us check how we can approximately halve the size of this model!

Simple Tricks to Reduce the Embedding Table Size

There are several ways to make this table smaller but usually, there is a trade-off:

Reducing the table size reduces your model's performance.

Still, we can try to find the sweet spot between memory and task performance – maybe a 50 MB model with a 91% accuracy is better than a 10 GB model with 93% accuracy. Let us see how we can achieve this.

Reducing the dimension

Perhaps the simplest way to decrease the memory is decreasing the dimension d.

This can work well, but once your dimension gets too low, your model might not be expressive enough anymore. Imagine decreasing the dimension to d = 1, for example. You might lose a lot of information in this case if you deal with millions of categories. Still, some people like to do a form of this extreme compression under the name target encoding.

Hashing trick

Another way is to decrease the number of categories N by binning several categories together. You can do it systematically, e.g., group several categories together that you think are quite similar. This can be a lot of work, so people often use hash functions to find a random binning.

In both cases, different categories can get the same embedding. This usually makes it harder to learn good embeddings because you might mix up categories that behave fundamentally differently.

Fewer rows mean some categories have to share the same embedding. Image by the author.

Lowering the float precision

One thing that I omitted so far is the size of the floats themselves. Instead of float32, we can downsample our embeddings to float16, or even float8. The table size is the same, but the number of bits per cell is reduced. This gives you reduction rates of 50% if using float16, or 75% if using float8.

Note that it is not recommended to just cast all float32s to float16s as this might cause training instabilities, i.e., you get NaNs while training.

Compositional Embeddings

Let us now turn to the paper Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems by Shi et al. In their paper, they describe a method to save memory that is better than the hashing trick.

Before we dive into how this works, let us briefly recap how we use our normal embeddings. Let us assume that the N categories are numbered from 0 to N – 1. If we want to retrieve the embedding of category i, we would just look up the i-th row of the matrix and output this vector as the embedding. Done.

The idea

They propose many different versions of their method, but let us start with their so-called quotient-remainder trick. Here, they create two smaller tables with embeddings of dimension d. Let us assume that one table is of size M and the other of size N / M. For a numerical example, let us assume that we have N = 100 categories and set M = 20. Then we have two embedding tables, one of size 20, and one of size 5.

Let us again assume that our N categories are numbered from 0 to N – 1, and that we want to get the embedding for category i. The idea is to look up two compositional embeddings for this category – one from each table – and assemble a final embedding from these two, for example by adding or multiplying them component-wise. The authors suggest multiplying them.

Thus, we created an embedding using two much smaller tables. In total, we only have to store M ⋅ d + (N / M) ⋅ d = (M + N / M) ⋅ d numbers, which is always smaller than N ⋅ d.

Their algorithm

And how do we know which embeddings from the smaller tables we should look up? Well, do a division with remainders! Assume that we need the embedding of some category with number i. If you remember your school days, you know that we can find two numbers q and r such that i = q ⋅ M + r. q is called the quotient, and r is the remainder.

As an example, take i = 77. We can write 77 as 3 ⋅ 20 + 17, and the quotient 3 and remainder 17 are the indices where we look up the compositional embeddings in the smaller tables.

In general, we get the following algorithm:

From their paper. m = M, D = d, |S| = N. The circle with the dot in the middle denotes the element-wise multiplication of vectors, also known as the Hadamard product.

What about j = 57? No problem, 2 ⋅ 20 + 17. Here, we can see how both categories i and j share the same remainder embedding 17, which relates the categories i = 77 and j = 57. However, their quotient embeddings differ, and hence so does their final embedding.

Reduction gain

You can get the quotient and remainder easily via i // M and i % M . I like the idea because it is simple to understand and implement. If you want to minimize the memory using this method, you have to minimize the function f(M) = M + N / M over M. If we ignore that M is an integer, we can take the derivative of f and get f‘(M) = 1 – N / _M_², and if we solve f‘(M) = 0, we get M = √N. In this case, both tables have about the same length.

This means that we only have to store about √_N ⋅ d i_nstead of the old N ⋅ d values, which is a huge gain. If you want to use numbers again, we can lower 100 ⋅ d values to 2 ⋅ √100 ⋅ d = 20⋅ d values, i.e., a reduction of 80%, independent of d. And if you think this is a lot, plug in larger values for N, which is usually the case when we use embeddings. For N = 10,000, we have a whopping 98% reduction already!

But keep in mind that this level of compression might result in notably worse performance. It is not always the best decision to use the memory-optimal even split. From the model performance perspective, an uneven split M ≠ √N might be better.

Generalization

You can easily generalize this idea. For a given index i, just split it somehow into two smaller numbers. And why even two? You can also split it into more – let's say k – numbers, and do the same logic. One way is to utilize the Chinese remainder theorem, which sounds more complicated than it is. Basically, you pick a bunch numbers _M_₁, _M_₂, …, Mₙ (called moduli) such that

each two of them don't have a common divisor, and
their product is larger than N.

These are technical requirements to ensure that two different categories i and j get different compositional embeddings, just as in the quotient-remainder case.

Then, you just compute i % M_1 , i % M_2 , …, i % M_n , and here you have n indices you can look up in n smaller tables of sizes _M_₁, _M_₂, …, Mₙ. The memory requirement is then simply (_M_₁ + _M_₂ + … + Mₙ) ⋅ d, which can be as low as n ⋅ ⁿ√N ⋅ d if the moduli are all about the size of ⁿ√N. As a numerical example, take N = 100, and n = 5. The reduction is about 87% in this case. For N = 10,000 and n = 5, we get about 99.7% reduction in memory requirements.

Again, be careful if this is not too much! You probably have to choose the numbers differently, which decreases the reduction, but improves the model's performance.

Implementation in TensorFlow

Before we implement this logic in TensorFlow, let us all get on the same page. Let us recap how you can use embedding layers in TensorFlow:

import tensorflow as tf

N = 100
d = 64

embedding_layer = tf.keras.layers.Embedding(N, d)
X = tf.constant([1, 2, 3])

print("Output shape:", embedding_layer(X).shape)
print("Embedding table shape:", embedding_layer.get_weights()[0].shape)

# Output:
# Output shape: (3, 64)
# Embedding table shape: (100, 64)

Quotient-remainder layer

So far, so easy. Let us now implement the quotient-remainder trick. The following code should make sense to you. X is a tensor (an array) with a lot of integer numbers. In the call method, we compute the quotient and remainder and look up the corresponding values from the compositional tables quotient_embedding and remainder_embedding .

class DivisionCompositionEmbedding(tf.keras.layers.Layer):
    def __init__(self, input_dim, output_dim, divisor, **embedding_layer_kwargs):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.divisor = divisor

        self.quotient_embedding = tf.keras.layers.Embedding(input_dim // divisor, output_dim, **embedding_layer_kwargs)
        self.remainder_embedding = tf.keras.layers.Embedding(divisor, output_dim, **embedding_layer_kwargs)

    def call(self, X):
        quotient_embedding = self.quotient_embedding(X // self.divisor)
        remainder_embedding = self.remainder_embedding(X % self.divisor)

        return quotient_embedding * remainder_embedding

You can use this layer like any other layer now.

N = 100
d = 64
M = 20

embedding_layer = DivisionCompositionEmbedding(N, d, M)
X = tf.constant([1, 2, 3])
print("Output shape:", embedding_layer(X).shape)
print("First embedding table shape:", embedding_layer.get_weights()[0].shape)
print("Second embedding table shape:", embedding_layer.get_weights()[1].shape)

# Ouput:
# Output shape: (3, 64)
# First embedding table shape: (5, 64)
# Second embedding table shape: (20, 64)

Chinese remainder layer

This one is easy as well. Just look:

class ChineseRemainderCompositionEmbedding(tf.keras.layers.Layer):
    def __init__(self, input_dim, output_dim, moduli, **embedding_layer_kwargs):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.moduli = moduli

        self.embeddings = [tf.keras.layers.Embedding(modulus, output_dim, **embedding_layer_kwargs) for modulus in moduli]
        self.multiply = tf.keras.layers.Multiply()

    def call(self, X):
        embeddings = [embedding(X % modulus) for embedding, modulus in zip(self.embeddings, self.moduli)]

        return self.multiply(embeddings)

Let us check the usage again:

N = 100
d = 64
Ms = (7, 11, 13)

embedding_layer = ChineseRemainderCompositionEmbedding(N, d, Ms)
X = tf.constant([1, 2, 3])
print("Output shape:", embedding_layer(X).shape)
print("First embedding table shape:", embedding_layer.get_weights()[0].shape)
print("Second embedding table shape:", embedding_layer.get_weights()[1].shape)
print("Third embedding table shape:", embedding_layer.get_weights()[2].shape)

# Output:
# Output shape: (3, 64)
# First embedding table shape: (7, 64)
# Second embedding table shape: (11, 64)
# Third embedding table shape: (13, 64)

Performance Results

The authors used the Criteo Ad Kaggle Competition dataset for click-through rate prediction to test their model. It has 13 dense features and 26 categorical features. They used two different neural network architectures to test their method: DCM and DLRM. Here is one of their results:

Figure 4 (from their paper): Validation loss against the number of iterations when training the DCN (left) and Facebook DLRM (right) networks over 5 trials. Both the mean and standard deviation are plotted. "Full Table" corresponds to the baseline using full embedding tables (without hashing), "Hash Trick" refers to the hashing trick, and "Q-R Trick" refers to the quotient-remainder trick (with element-wise multiplication). Note that the hashing trick and quotient-remainder trick result in an approximate 4× reduction in model size.

You can see that the hashing trick and quotient-remainder trick produce much smaller models, but that the compression method of the quotient-remainder trick seems to be somehow better. The hashing trick embeddings deteriorate the model's performance more than the quotient-remainder trick embeddings, at least for this specific dataset and these models. They have more graphs in the paper, make sure to check it out!

Unfortunately, the authors forgot to compare their new embedding layers with simple methods such as dimensionality or precision reduction. It would have been interesting to see what happens if you just decrease the d by 80%, for example.

My experiments

I also played around with it by building a simple recommender for the MovieLens 20M dataset. This dataset is full of movie ratings by users in the form of a star rating from 1 to 5. I have built a regression model that

embeds a user,
embeds a movie,
concatenates these embeddings, and
uses a linear layer to output a single number.

Here is the code:

import tensorflow as tf

class Model(tf.keras.Model):
    def __init__(self, user_model, movie_model):
        super().__init__()
        self.user_model = user_model
        self.movie_model = movie_model
        self.concatenate = tf.keras.layers.Concatenate()
        self.linear = tf.keras.layers.Dense(1, activation="sigmoid")

    def call(self, X):
        concat = self.concatenate([self.user_model(X[:, 0]), self.movie_model(X[:, 1])])
        return 4.5 * self.linear(concat) + 0.5

Note that you can pass the embedding layers to the model so that I can try out a full (normal) embedding layer, but also hashing, the quotient-remainder, and the Chinese remainder trick.

Then, you can load the data using polars like this:

# !pip install polars

import polars as pl

TRAIN_SET_SIZE = 19_000_000
EMBEDDING_DIM = 8
BATCH_SIZE = 1024

ratings = (
    pl.scan_csv(
        "ml-20m/ratings.csv",
    )
    .with_columns(pl.col("userId").rank(method="dense").alias("userId_encoded"), pl.col("movieId").rank(method="dense").alias("movieId_encoded"))
    .collect()
    .sample(fraction=1.0, shuffle=True, seed=0)
)

train = ratings.head(TRAIN_SET_SIZE)
evaluation = ratings.tail(-TRAIN_SET_SIZE)

X_train = train.select(["userId_encoded", "movieId_encoded"]).cast(pl.Int32).to_numpy()
y_train = train.select(["rating"]).to_numpy()
X_evaluation = evaluation.select(["userId_encoded", "movieId_encoded"]).cast(pl.Int32).to_numpy()
y_evaluation = evaluation.select(["rating"]).to_numpy()

train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(BATCH_SIZE).cache().prefetch(tf.data.AUTOTUNE)
evaluation_ds = tf.data.Dataset.from_tensor_slices((X_evaluation, y_evaluation)).batch(BATCH_SIZE).cache().prefetch(tf.data.AUTOTUNE)

Note that I used a quite small embedding dimension of 8, so decreasing the performance even further is difficult. We can still decrease it to 1 dimension, which yields a maximum memory reduction rate of 87.5%.

I created a small training function

def train(user_model, movie_model):
    model = Model(user_model, movie_model)
    model.compile(loss="mse", optimizer="adam")
    model.fit(train_ds, epochs=3, validation_data=evaluation_ds, callbacks=[tf.keras.callbacks.EarlyStopping()])
    print(model.summary())

    return model

My results:

In the even case, I have chosen M = *N0.5, i.e., maximum compression. In the uneven case, I used M = N0.8. For the Chinese remainder, I used 3 moduli around 200.* Image by the author.

I could replicate that the authors' proposed methods are definitely better than the simple hashing trick. But the biggest surprise was that even reducing the dimension to 1 yields a model that is on par with the original model. ** We got an 87% reduction for free, in terms of model performance and** code changes. Wow.

I'm not saying that simple dimensionality reduction would have worked for the authors as well, but it would have been interesting to see what would have happened compared to their methods. I will drop them a message and update the article whenever I receive an answer.

Note: I also tried using float16 instead of float32, but I ran into the aforementioned numerical issues, and the training could not finish.

Conclusion & Discussion

In this article, we have seen that embeddings are a neat tool to encode categorical data. However, these embeddings can result in huge models that can be hard to deploy, so it is good to know measures to reduce the embedding table size.

First, we have discussed some easy methods, such as

decreasing the width of the table (decreasing d)
decreasing the height of the table (decreasing N, hashing trick)
decreasing the cells' memory (less float precision)

Then, I walked you through another idea by Shi et al. Instead of having a single large table, let's have several smaller ones and piece together our desired result with them. This is something fundamentally different than

having an embedding per category, and
randomly throwing together categories to give them the same embedding (hashing)

In their approach, a category gets several compositional embeddings that can then be multiplied, for example, to get a final embedding. Each of these compositional embeddings is used by several categories, which is similar to the hashing trick. Still, the methods that the authors created ensure that no two final embeddings for two different categories will be the same, which is similar to what we have in a full hashing table.