How to Find the Best Multilingual Embedding Model for Your RAG

Author:Murphy | View: 22720 | Time: 2025-03-22 23:06:06

Embeddings are vector representations that capture the semantic meaning of words or sentences. Besides having quality data, choosing a good embedding model is the most important and underrated step for optimizing your RAG application. Multilingual models are especially challenging as most are pre-trained on English data. The right embeddings make a huge difference – don't just grab the first model you see!

The semantic space determines the relationships between words and concepts. An accurate semantic space improves retrieval performance. Inaccurate embeddings lead to irrelevant chunks or missing information. A better model directly improves your RAG system's capabilities.

In this article, we will create a question-answer dataset from PDF documents in order to find the best model for our task and language. During RAG, if the expected answer is retrieved, it means the embedding model positioned the question and answer close enough in the semantic space.

While we focus on French and Italian, the process can be adapted to any language because the best embeddings might differ.

Embedding Models

There are two main types of embedding models: static and dynamic. Static embeddings like word2vec generate a vector for each word. The vectors are combined, often by averaging, to create a final embedding. These types of embeddings are not often used in production anymore because they don't consider how a word's meaning can change in function to the surrounding words.

Dynamic embeddings are based on Transformers like BERT, which incorporate context awareness through self-attention layers, allowing them to represent words based on the surrounding context.

Most current fine-tuned models use contrastive learning. The model learns semantic similarity by seeing both positive and negative text pairs during training.

Process of optimizing the Embedding Space. Image by author.

In an accurate semantic space, words, and phrases with similar meanings are close together, while contradicting ones are far apart. Below, you can see a two-dimensional PCA of sentences embedded using bge-base-en-v1.5-angle. The original embeddings had 768 dimensions, and PCA reduced it to 2 dimensions.

Example of 2D PCA of Sentence Embeddings. Image by author.

The graph demonstrates how the semantic space organizes sentences with related meanings close together.

Note how the sentences That new sports car is sick! and I feel sick, which have positive and negative connotations of the word sick, are positioned far apart. The embedding model understands the meaning of the word from the surrounding context.

During RAG, the retrieval mechanism uses semantic similarity to identify documents close to the user query. So, an inaccurate semantic space that causes irrelevant documents to be in the vicinity of the query would lead to poor answers.

The wide range of pre-trained transformers has created many possible text embedding models to explore. It's not easy to determine the optimal model for specific needs.

Massive Text Embedding Benchmark

An overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade. [3]

Massive Text Embedding Benchmark (MTEB) evaluates embedding models across 8 tasks and 58 datasets. 10 of these datasets are multilingual, covering 112 languages. The tasks are bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization.

We focus only on the retrieval task because it is the most relevant for RAG. In order to evaluate it, each dataset has a corpus, queries, and query-document mappings. The queries and corpus are embedded to find relevant documents using cosine similarity.

Model performance is measured by Normalized Discounted Cumulative Gain at 10 (nDCG@10). This metric accounts for the position of relevant documents in the ranking. We won't go into too much detail about nDCG, but we recommend this in-depth article if you want to read more.

Unfortunately, the initial MTEB retrieval datasets were only in English. Recently, some were added in Polish and Swedish, but multilingual support is still limited [2].

We will evaluate the current top 5 multilingual embedding models from the MTEB Leaderboard against a baseline Sentence Transformers model. This analysis will reveal which models perform best for retrieval across French and Italian, but you can customize the evaluation easily for any language.

First, let's look at some key model properties: sequence length and embedding dimensions. The sequence length refers to the maximum number of tokens the model can process at once. Anything longer than the context window gets truncated. The embedding dimension refers to the vector size. Larger vectors can capture more meaning from sentences but may be less storage-efficient.

Comparison of Embedding Models Specifications. Image by author.

Cohere-embed-multilingual-v3.0 & light-v3.0

Cohere offers a proprietary embedding model accessible through an API at the cost of $0.10/1M tokens, which is the same price as for ada-002. Besides embedding, the model can rank the most relevant documents at the top during retrieval by assessing how well a query matches a document's topic. They also implemented compression-aware training for more efficient storage [4].

The light version of the multilingual Cohere model has an embedding dimension of only 384 [7].

intfloat/multilingual-e5-large

The E5 family of models was trained using weak supervision in a contrastive manner using the CCPair dataset [5]. The dataset was created by extracting text pairs from various sources, including Reddit, StackExchange, Wikipedia, scientific papers, Common Crawl, and news articles. The authors compiled text pairs consisting of a query q and a corresponding passage p, as shown below.

text-embedding-3-large

OpenAI claims that this is their most performant model with higher multilingual performance than ada-002. An interesting improvement is adding the possibility of reducing the embedding dimensions from 3072 to as low as you want. While there is an accuracy-cost tradeoff associated with lower dimensionality, having the flexibility to use smaller embeddings can provide helpful savings on memory and storage requirements [6].

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

We consider this model as a baseline because it's one of the most downloaded Sentence Transformers models from Hugging Face.

It is based on the SBERT architecture, which introduced the idea of a triple network structure. The network takes three inputs – an anchor, a positive example, and a negative example. Then, it is trained so that the anchor is closer to the positive than the negative example in the embedding space.

Generating QA pairs

In order to have as little bias between languages as possible, our chosen documents are the 2023 European Semester Country Reports for France and Italy [9]. They have the same number of pages and a very similar structure, being official reports from the European Commission.

We provide each text chunk from the corpus as context to GPT-3.5-turbo and prompt it to generate a question based on that context. LLMs typically respond more accurately when both the prompt and the desired response are in the same language, so I translated the prompts into Italian and French.

prompt_it = """
Le informazioni sul contesto sono riportate di seguito.

---------------------
{context_str}
---------------------

Date le informazioni sul contesto e non le conoscenze pregresse.
generare solo domande basate sulla domanda seguente.

Siete un insegnante/professore. Il vostro compito è quello di impostare 
{num_questions_per_chunk} per un prossimo quiz.
Le domande devono essere di natura diversa nel documento. 
Limitare le domande alle informazioni di contesto fornite. 
Le domande devono essere in italiano."
"""

prompt_fr = """
Les informations contextuelles se trouvent ci-dessous.

---------------------
{context_str}
---------------------

Compte tenu des informations contextuelles et sans connaissances préalables, générer uniquement des questions basées sur la requête ci-dessous.

Vous êtes enseignant/professeur. Votre tâche consiste à mettre en place 
{num_questions_per_chunk} pour un quiz à venir. 
Les questions doivent être de nature variée sur l'ensemble du document. 
Limitez les questions aux informations contextuelles fournies. 
Les questions doivent être en français."
"""

We already split the text into chunks of 1000 characters and ingested them into nodes, so now we go straight to creating the dataset.

fr_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=fr_nodes, qa_generate_prompt_tmpl = prompt_fr
)
fr_dataset.save_json("fr_dataset.json")

it_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=it_nodes, qa_generate_prompt_tmpl = prompt_it
)

it_dataset.save_json("it_dataset.json")

The resulting dataset has 4 keys: queries, corpus, relevant_docs, and mode equal to text. Each query ID has an associated relevant document node ID. This is an example of a query from the Italian dataset and the corresponding text chunk.

First query from the Italian dataset. Image by author.

Evaluation metrics

We'll retrieve the top 5 documents and evaluate their relevance using Hit Rate and MRR. The Hit Rate checks whether the expected relevant document is among the top 5. So, the overall value would be the proportion of queries for which we have a hit.

MRR is calculated as the reciprocal of the rank at which the relevant document is found (1/rank). If the relevant document is not found, MRR is 0 for that query. The final MRR would be the average of the MRR values calculated for each query.

Models' performance

The newest model from OpenAI achieved the top score, which is expected since it was also the highest-performing multilingual model on the MTEB benchmark. The second place was taken by another proprietary system from Cohere. Their normal and light versions had similar results on the French dataset, but the normal version greatly outperformed the light one on the Italian dataset.

We also tried to reduce the embeddings of text-embedding-3-large to 256, and the performance was pretty impressive, considering the small dimension.

The other embedding model from OpenAI, ada-002, scored considerably lower than their newest one, showing that the update brought impressive improvements.

Intfloat's open-source multilingual-e5-large is the best open-source model, with significantly higher scores on the Italian dataset than on the French one.

The results from paraphrase-multilingual-MiniLM-L12-v2 are disappointing. While it was very fast, we had hoped for a better performance.

Conclusion

The MTEB leaderboard offers a good initial benchmark for evaluating multilingual models. However, in order to improve retrieval, it is better to customize the evaluation to your needs.

As we've seen, model performance varies significantly across languages. That's why it is important to have a system to quickly evaluate embedding models using one's own documents.

. . .

If you enjoyed this article, join Text Generation – our newsletter has two weekly posts with the latest insights on Generative AI and Large Language Models.

You can find the full code for this project on GitHub.

You can also find me on LinkedIn.

. . .