How and Why to use LLMs for Chunk-Based Information Retrieval

In this article, I aim to explain how and why it's beneficial to use a Large Language Model (Llm) for chunk-based information retrieval.
I use OpenAI's GPT-4 model as an example, but this approach can be applied with any other LLM, such as those from Hugging Face, Claude, and others.
Everyone can access this article for free.
Considerations on standard information retrieval
The primary concept involves having a list of documents (chunks of text) stored in a database, which could be retrieve based on some filter and conditions.
Typically, a tool is used to enable hybrid search (such as Azure AI Search, LlamaIndex, etc.), which allows:
- performing a text-based search using term frequency algorithms like TF-IDF (e.g., BM25);
- conducting a vector-based search, which identifies similar concepts even when different terms are used, by calculating vector distances (typically cosine similarity);
- combining elements from steps 1 and 2, weighting them to highlight the most relevant results.

Figure 1 shows the classic retrieval pipeline:
- the user asks the system a question: "I would like to talk about Paris";
- the system receives the question, converts it into an embedding vector (using the same model applied in the ingestion phase), and finds the chunks with the smallest distances;
- the system also performs a text-based search based on frequency;
- the chunks returned from both processes undergo further evaluation and are reordered based on a ranking formula.
This solution achieves good results but has some limitations:
- not all relevant chunks are always retrieved;
- sometime some chunks contain anomalies that affect the final response.
An example of a typical retrieval issue
Let's consider the "documents" array, which represents an example of a knowledge base that could lead to incorrect chunk selection.
documents = [
"Chunk 1: This document contains information about topic A.",
"Chunk 2: Insights related to topic B can be found here.",
"Chunk 3: This chunk discusses topic C in detail.",
"Chunk 4: Further insights on topic D are covered here.",
"Chunk 5: Another chunk with more data on topic E.",
"Chunk 6: Extensive research on topic F is presented.",
"Chunk 7: Information on topic G is explained here.",
"Chunk 8: This document expands on topic H. It also talk about topic B",
"Chunk 9: Nothing about topic B are given.",
"Chunk 10: Finally, a discussion of topic J. This document doesn't contain information about topic B"
]
Let's assume we have a RAG system, consisting of a vector database with hybrid search capabilities and an LLM-based prompt, to which the user poses the following question: "I need to know something about topic B."
As shown in Figure 2, the search also returns an incorrect chunk that, while semantically relevant, is not suitable for answering the question and, in some cases, could even confuse the LLM tasked with providing a response.

In this example, the user requests information about "topic B," and the search returns chunks that include "This document expands on topic H. It also talks about topic B" and "Insights related to topic B can be found here." as well as the chunk stating, "Nothing about topic B are given".
While this is the expected behavior of hybrid search (as chunks reference "topic B"), it is not the desired outcome, as the third chunk is returned without recognizing that it isn't helpful for answering the question.
The retrieval didn't produce the intended result, not only because the BM25 search found the term "topic B" in the third Chunk but also because the vector search yielded a high cosine similarity.
To understand this, refer to Figure 3, which shows the cosine similarity values of the chunks relative to the question, using OpenAI's text-embedding-ada-002 model for embeddings.

It is evident that the cosine similarity value for "Chunk 9" is among the highest, and that between this chunk and chunk 10, which references "topic B," there is also chunk 1, which does not mention "topic B".
This situation remains unchanged even when measuring distance using a different method, as seen in the case of Minkowski distance.
Utilizing LLMs for Information Retrieval: An Example
The solution I will describe is inspired by what has been published in my GitHub repository https://github.com/peronc/LLMRetriever/.
The idea is to have the LLM analyze which chunks are useful for answering the user's question, not by ranking the returned chunks (as in the case of RankGPT) but by directly evaluating all the available chunks.

In summary, as shown in Figure 4, the system receives a list of documents to analyze, which can come from any data source, such as file storage, relational databases, or vector databases.
The chunks are divided into groups and processed in parallel by a number of threads proportional to the total amount of chunks.
The logic for each thread includes a loop that iterates through the input chunks, calling an OpenAI prompt for each one to check its relevance to the user's question.
The prompt returns the chunk along with a boolean value: true if it is relevant and false if it is not.
Lets'go coding