How to Improve LLMs with RAG

Author:Murphy  |  View: 27544  |  Time: 2025-03-22 22:31:09

This article is part of a larger series on using large language models in practice. In the previous post, we fine-tuned Mistral-7b-Instruct to respond to YouTube comments using QLoRA. Although the fine-tuned model successfully captured my style when responding to viewer feedback, its responses to technical questions didn't match my explanations. Here, I'll discuss how we can improve LLM performance using retrieval augmented generation (i.e. RAG).

The original RAG system. Image from Canva.

Large language models (LLMs) have demonstrated an impressive ability to store and deploy vast knowledge in response to user queries. While this has enabled the creation of powerful AI systems like ChatGPT, compressing world knowledge in this way has two key limitations.

First, an LLM's knowledge is static, i.e., not updated as new information becomes available. Second, LLMs may have an insufficient "understanding" of niche and specialized information that was not prominent in their training data. These limitations can result in undesirable (and even fictional) model responses to user queries.

One way we can mitigate these limitations is to augment a model via a specialized and mutable knowledge base, e.g., customer FAQs, software documentation, or product catalogs. This enables the creation of more robust and adaptable AI systems.

Retrieval augmented generation, or RAG, is one such approach. Here, I provide a high-level introduction to RAG and share example Python code for implementing a RAG system using LlamaIndex.

What is RAG?

The basic usage of an LLM consists of giving it a prompt and getting back a response.

Basic usage of an LLM i.e. prompt in, response out. Image by author.

RAG works by adding a step to this basic process. Namely, a retrieval step is performed where, based on the user's prompt, the relevant information is extracted from an external knowledge base and injected into the prompt before being passed to the LLM.

Overview of RAG system. Image by author.

Why we care

Notice that RAG does not fundamentally change how we use an LLM; it's still prompt-in and response-out. RAG simply augments this process (hence the name).

This makes RAG a flexible and (relatively) straightforward way to improve LLM-based systems. Additionally, since knowledge is stored in an external database, updating system knowledge is as simple as adding or removing records from a table.

Why not fine-tune?

Previous articles in this series discussed fine-tuning, which adapts an existing model for a particular use case. While this is an alternative way to endow an LLM with specialized knowledge, empirically, fine-tuning seems to be less effective than RAG at doing this [1].

How it works

There are 2 key elements of a RAG system: a retriever and a knowledge base.

Retriever

A retriever takes a user prompt and returns relevant items from a knowledge base. This typically works using so-called text embeddings, numerical representations of text in concept space. In other words, these are numbers that represent the meaning of a given text.

Text embeddings can be used to compute a similarity score between the user's query and each item in the knowledge base. The result of this process is a ranking of each item's relevance to the input query.

The retriever can then take the top k (say k=3) most relevant items and inject them into the user prompt. This augmented prompt is then passed into the LLM for generation.

Overview of retrieval step. Image by author.

Knowledge Base

The next key element of a RAG system is a knowledge base. This houses all the information you want to make available to the LLM. While there are countless ways to construct a knowledge base for RAG, here I'll focus on building one from a set of documents.

The process can be broken down into 4 key steps [2,3].

  1. Load docs – This consists of gathering a collection of documents and ensuring they are in a ready-to-parse format (more on this later).
  2. Chunk docs—Since LLMs have limited context windows, documents must be split into smaller chunks (e.g., 256 or 512 characters long).
  3. Embed chunks – Translate each chunk into numbers using a text embedding model.
  4. Load into Vector DB— Load text embeddings into a database (aka a vector database).
Overview of knowledge base creation. Image by author.

Some Nuances

While the steps for building a RAG system are conceptually simple, several nuances can make building one (in the real world) more complicated.

Document preparation—The quality of a RAG system is driven by how well useful information can be extracted from source documents. For example, if a document is unformatted and full of images and tables, it will be more difficult to parse than a well-formatted text file.

Choosing the right chunk size—We already mentioned the need for chunking due to LLM context windows. However, there are 2 additional motivations for chunking.

First, it keeps (compute) costs down. The more text you inject into the prompt, the more compute required to generate a completion. The second is performance. Relevant information for a particular query tends to be localized in source documents (often, just 1 sentence can answer a question). Chunking helps minimize the amount of irrelevant information passed into the model [4].

Improving search – While text embeddings enable a powerful and fast way to do search, it doesn't always work as one might hope. In other words, it may return results that are "similar" to the user query, yet not helpful for answering it, e.g., "How's the weather in LA?" may return "How's the weather in NYC?".

The simplest way to mitigate this is through good document preparation and chunking. However, for some use cases, additional strategies for improving search might be necessary, such as using meta-tags for each chunk, employing hybrid search, which combines keyword—and embedding-based search, or using a reranker, which is a specialized model that computes the similarity of 2 input pieces of text.

Example code: Improving YouTube Comment Responder with RAG

With a basic understanding of how RAG works, let's see how to use it in practice. I will build upon the example from the previous article, where I fine-tuned Mistral-7B-Instruct to respond to YouTube comments using QLoRA. We will use LlamaIndex to add a RAG system to the fine-tuned model from before.

The example code is freely available in a Colab Notebook, which can run on the (free) T4 GPU provided. The source files for this example are available at the GitHub repository.

Tags: AI Data Science Editors Pick Llm Machine Learning

Comment