Beyond Naive RAG: Advanced Techniques for Building Smarter and Reliable AI Systems

Author:Murphy | View: 25135 | Time: 2025-03-22 20:01:56

Beyond Naïve RAG: Advanced RAG Techniques (Source: Image by Author)

Have you ever asked a generative AI app, like ChatGPT, a question and found the answer incomplete, outdated, or just plain wrong? What if there was a way to fix this and make AI more accurate? There is! It's called Retrieval Augmented Generation or just RAG. A novel concept introduced by Lewis et al in their seminal paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , RAG has swiftly emerged as a cornerstone, enhancing reliability and trustworthiness in the outputs from Large Language Models (LLMs). LLMs have been shown to store factual knowledge in their parameters, also referred to as parametric memory and this knowledge is rooted in the data the LLM has been trained on. RAG enhances the knowledge of the LLMs by giving them access to an external information store, or a knowledge base. This knowledge base is also referred to as non-parametric memory (because it is not stored in model parameters). In 2024, RAG is one of the most widely used techniques in generative AI applications.

60% of LLM applications utilize some form of RAG

Databricks

RAG's acceptance is also propelled by the simplicity of the concept. Simply put, a RAG system searches for information from a knowledge base and sends it along with the query to the LLM for the response.

Retrieval Augmented Generation (Source: Image by Author)

While RAG improves the reliability of LLM responses, implementing RAG comes with its own set of challenges. RAG failures are observed at the retrieval stage where the retrieved information can be incomplete or incorrect, and at the generation stage where the LLM may fail to generate the expected response. To address these failures, several techniques and strategies have been discovered and demonstrated.

In this blog post, we will focus on some of the most useful RAG techniques that propel the performance of a RAG system. We will start with a quick refresher on the basics of RAG and RAG evaluations to set the stage. We will cover –

What is RAG and why do we need it? – LLM limitations and how RAG helps.
Why do RAG failures occur and how to evaluate RAG? – Points of failure in RAG pipelines and evaluation frameworks
Naive RAG and its shortcomings – Basic RAG pipelines and why they're not suitable for production
Pre-retrieval techniques – Techniques focused at setting up the knowledge base and the user query for better retrieval
Advanced Retrieval strategies – Strategies that combine retrieval methods and use repetitive retrieval techniques
Post-retrieval techniques – Techniques that optimise the retrieved information and the LLM generation for better outcomes
Trade-offs and other considerations – What to bear in mind when designing advanced RAG systems

By the end of this blog, you'll have refreshed your understanding of RAG basics and explored advanced techniques, complete with code snippets, to build production-grade RAG pipelines.

Refresher: What is RAG and why do we need it?

LLMs can generate incomplete, outdated and inaccurate responses (Source: Image by Author)

LLM Limitations

We expect Large Language Models, or LLMs, to know everything, to be up-to-date, and to generate factually accurate responses every time. But the reality is that LLMs fall short due to three main limitations:

Knowledge Cut-off Date : Training an LLM is an expensive and time-consuming process. It takes massive volumes of data and several weeks, or even months, to train an LLM. The data that LLMs are trained on is therefore not current. For example, GPT-4o has knowledge only up to October 2023. Any event that happened after this knowledge cut-off date is not available to the model.
Training Data Limitation : LLMs are trained on large volumes of data from a variety of public sources – like Llama 3 has been trained on a whopping 15 trillion tokens (about 7 times more than Llama 2) – but they do not have any knowledge of information that is not public. Publicly available LLMs have not been trained on information like internal company documents, customer information, product documents, etc. So, LLMs cannot be expected to respond to queries about such information.
Hallucinations : And finally, by design, LLMs are next-word predictors. They are not trained to verify the facts in their responses. Thus, it is observed that LLMs sometimes provide responses that are factually incorrect, and despite being incorrect, these responses sound extremely confident and legitimate. This characteristic of "lying with confidence," called hallucination, has proved to be one of the biggest criticisms of LLMs.

What are AI Hallucinations and Why Are They a Problem? TechTarget

Expectations vs Practical Limitations of Generative AI (Source: Image by Author)

Does that mean this technology is not useful? Absolutely not. LLMs can consume and process information very efficiently. If you can point an LLM to a source of information, it can process that information to generate accurate results. This source of information can be your company documents, third-party databases, or even the internet.

Now when I ask an LLM a question like "Who won the 2024 T20 World Cup?" and provide a source of information, the LLM processes it and responds with the correct answer. This is the main idea behind Retrieval Augmented Generation.

Using external information the LLM generates correct answers (Source: Image by Author)

To read more about the significance of RAG in LLMs, you can also read the following blog –

Context is Key: The Significance of RAG in Language Models

Simple RAG Pipelines

Let's look at a simple definition of RAG.

The technique of enhancing the parametric memory of an LLM by creating access to an explicit non-parametric knowledge base, from which a retriever can fetch relevant information, augment that information to the prompt, pass the prompt to an LLM to enable the LLM to generate a response that is contextual, reliable, and factually accurate is called Retrieval Augmented Generation

There are two processes that can be inferred from this definition –

A pipeline that accepts a user query, retrieves information relevant to that query and then passes the query along with the retrieved information to the LLM. This we will call the generation pipeline. This pipeline allows for a real-time interaction with the system.
The other critical aspect of a RAG system is the creation the non-parametric knowledge base. It is this knowledge base that the retriever in the generation pipeline fetches information from. To create and maintain the knowledge base an indexing pipeline has to be set-up. This need not be a real-time process. Once the knowledge base is set-up it only needs periodic updates as and when the data in the source systems refreshes.

A simple indexing pipeline comprises of four steps –

Getting the Most from LLMs: Building a Knowledge Brain for Retrieval Augmented Generation

Data loading from source systems
Splitting large documents into smaller chunks.
Converting chunks into dense embeddings
Storing the embeddings into vector databases.

In these four steps, a knowledge base for the generation pipeline is created.

The generation pipeline has three components –

The retriever which searches for information in the knowledge base
Augmentation which is done via prompt engineering techniques like few-shot prompting, chain-of-thought, etc.
The LLM which is responsible for the generation of the final response.

Indexing and generation pipelines (Source: Image by Author)

For a deeper understanding of different components of the basic RAG pipeline, you can refer to the following blogs –

Embeddings – The Blueprint of Contextual AI

Breaking It Down : Chunking Techniques for Better RAG

RAG Value Chain: Retrieval Strategies in Information Augmentation for Large Language Models

If you're interested in setting up a basic rag system with the two pipelines, you can refer to the code below. In this code, we create a question answering system out of the Wikipedia page for the 2024 T20 Men's Cricket World Cup. We use OpenAI for embeddings & LLMs, and FAISS as the vector index for storing the embeddings. We use LangChain as the orchestration framework here.

#########################
### INDEXING PIPELINE ###
#########################

## Data Loading ##

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer

#This is the url of the Wikipedia page on the 2024 Men's Cricket World Cup
url="https://en.wikipedia.org/wiki/2024_ICC_Men%27s_T20_World_Cup"

#Instantiating the AsyncHtmlLoader
loader = AsyncHtmlLoader (url)
#Loading the extracted information
data = loader.load()
#Instantiate the Html2TextTransformer function
html2text = Html2TextTransformer()
#Call transform_documents
data_transformed = html2text.transform_documents(data)

## Chunking ##

from langchain_text_splitters import CharacterTextSplitter
#Set the CharacterTextSplitter parameters
text_splitter = CharacterTextSplitter(
separator="n", #The character that should be used to split 
chunk_size=1000, #Number of characters in each chunk 
chunk_overlap=200, #Number of overlapping characters between chunks
)
#Create Chunks
chunks=text_splitter.create_documents([data_transformed[0].page_content])

## Embeddings & Vector Storage

# Import OpenAIEmbeddings from the library
from langchain_openai import OpenAIEmbeddings
# Import FAISS class from vectorstore library
from langchain_community.vectorstores import FAISS
# Instantiate the embeddings object
embeddings=OpenAIEmbeddings(model="text-embedding-3-large")
# Create the knowledge base
db=FAISS.from_documents(chunks,embeddings)

## 'db' is the knowledge base and we can store it in persistent memory ##
db.save_local("../../Assets/Data")

##############################INDEXING PIPELINE ENDS ####################
#########################################################################

###########################
### GENERATION PIPELINE ###
###########################

## Retrieval function ##

def retrieve_context(query, db_path):
    embeddings=OpenAIEmbeddings(model="text-embedding-3-large")

    # Load the database stored in the local directory
    db=FAISS.load_local(db_path, embeddings, allow_dangerous_deserialization=True)

    # Ranking the chunks in descending order of similarity
    docs = db.similarity_search(query)
    # Selecting first chunk as the retrieved information
    retrieved_context=docs[0].page_content

    return str(retrieved_context)

## Augmentation Function ##
def create_augmeted(query, db_path):

    retrieved_context=retrieve_context(query,db_path)

    # Creating the prompt
    augmented_prompt=f"""

    Given the context below answer the question.

    Question: {query} 

    Context : {retrieved_context}

    Remember to answer only based on the context provided and not from /
    any other source. 

    If the question cannot be answered based on the provided context, /
    say I don't know.

    """

    return str(augmented_prompt)

## Generation Function ##
def create_rag(query, db_path):

    augmented_prompt=create_augmeted(query,db_path)

    # Importing the OpenAI library
    from openai import OpenAI

    # Instantiate the OpenAI client
    client = OpenAI()

    # Make the API call passing the augmented prompt to the LLM
    response = client.chat.completions.create(
    model="gpt-4o",
    messages= [
        {"role": "user", "content": augmented_prompt}
    ]
    )

    # Extract the answer from the response object
    answer=response.choices[0].message.content

    return answer
############################
## Question Answer System ##
############################

create_rag("What was Virat Kohli's achievement in the Cup?",
              "../../Assets/Data" )

The source code can be found in the GitHub Repository of A Simple Guide to Retrieval Augmented Generation

https://github.com/abhinav-kimothi/A-Simple-Guide-to-RAG

GitHub – abhinav-kimothi/A-Simple-Guide-to-RAG: This repository is the source code for examples and…

A Simple Guide to Retrieval Augmented Generation

Refresher: Why do RAG failures occur and how to evaluate RAG?

We have been discussing a very basic implementation of RAG which is also commonly termed Naïve RAG. Naive ** RAG** can be marred by inaccuracies. It can be inefficient in retrieving and ranking information correctly. The LLM can ignore the retrieved information and still hallucinate. Building a PoC RAG pipeline is not overtly complex. It is achievable through brief training and verification on a limited set of examples. However, to enhance its robustness, thorough testing on a dataset that accurately mirrors the production use case is imperative. RAG pipelines can suffer from hallucinations of their own. This can be because –

The retriever fails to retrieve the entire context or, worse, retrieves irrelevant context
The LLM, despite being provided the context, does not consider it
The LLM instead of answering the query picks irrelevant information from the context

RAG systems can suffer hallucinations of their own (Source: Image by Author)

Before addressing these points of failure, it is important to assess them. After all, you can't improve what you don't measure. There are three critical enablers of RAG evaluation – Frameworks, Benchmarks & Metrics.

The most commonly used metrics are –

Context Relevance: This dimension evaluates how relevant the retrieved information or context is to the user query. It calculates metrics like the precision and recall with which context is retrieved from the knowledge base.
Answer Faithfulness (also called groundedness): This dimension evaluates if the answer generated by the system is using the retrieved information or not.
Answer Relevance: This dimension evaluates how relevant the answer generated by the system is to the original user query.

The triad of RAG evaluation (Source: Inspired from https://truera.com/ai-quality-education/generative-ai-rags/what-is-the-rag-triad/)

Retrieval Augmented Generation Assessment or RAGAs is a framework developed by Exploding Gradients that assesses the retrieval and generation components of RAG systems without relying on extensive human annotations. RAGAs helps in –

Synthetically generate a test dataset that can be used to evaluate a RAG pipeline
Use metrics to measure the performance of the pipeline
Monitor the quality of the application in production

To dive deeper into RAG evaluations, please take a look at this exhaustive blog –

Stop Guessing and Measure Your RAG System to Drive Real Improvements

Naive RAG and its shortcomings

Basic or Naive RAG is a read-retrieve framework i.e. the retriever first searches and fetches information and then the LLM reads this information. This implies a linear process of indexing, retrieval, augmentation and generation. Each of these stages are plagued with inefficiencies.

Retrieval – A low precision and low recall is observed where the retriever either misses to fetch all information and/or fetches incorrect information.

Naive RAG can result in an undesirable Low Precision Low Recall scenario (Source: Image by Author)

Augmentation – When information is sourced from multiple chunks, it can be disjointed thus losing meaning and confusing the LLM. Also, the possibility of redundancy and repetition in the retrieved information is quite high
Generation – The LLM as a result of sub-optimal retrieval and augmentation can get overwhelmed and confused with the information. It is also observed that LLMs become over-reliant on the retrieved information and forget to use their inherent parametric knowledge.

Naive RAG is a Retrieve then Read framework (Source: Image by Author)

Now that we've gotten an overview of what RAG is, what are its limitations and how does one evaluate RAG, let's move onto the core focus area of this blog – How can RAG pipelines be improved?

Advanced RAG Techniques

In the last few years, a lot of research and experimentation has been done to address these drawbacks. Early approaches involved pre-training language models. Techniques involving fine-tuning of the LLMs, embeddings models and retrievers have also been tried. These techniques require training data and re-computation of model weights, generally, using supervised learning techniques. While these approaches are beneficial, they require significant effort and resources. In this blog, we will focus on techniques that can be applied at different stages of existing Naive RAG pipelines. We'll discuss them in three stages –

Pre-retrieval Stage – Before the retriever comes into play, the knowledge base and the user query can be optimised to set the system up for a more effective retrieval
Retrieval Stage – Hybrid and repetitive retrieval strategies improve the recall and precision of the system.
Post-retrieval Stage – The retrieved information is further optimised so that the LLM can consume it to generate better quality outputs.

Advanced RAG improves upon the ‘Read-Retrieve' framework of the Naive RAG by introducing two more steps – rewrite and rerank. Rewrite implies rewriting of the user query or the information in the knowledge base. Rerank implies optimisation of the retrieved information by further checking the similarity to the user query. Advanced RAG thus becomes a Rewrite-Retrieve-Rerank-Read framework.

Rewrite-Retrieve-Rerank-Read framework of Advanced RAG (Source: Image by Author)

Pre-retrieval techniques

The key outcome we are driving in the pre-retrieval stage is to set the system up for better retrieval. This is before the retriever is invoked. Let us understand why this is necessary. There are two reasons for this –

The knowledge base is not correctly indexed – meaning that the retriever is not able to correctly search the information because of the way the information is stored in the knowledge base. To address this, a few index optimisation techniques are employed.
The retriever doesn't understand the user query – this can either be because the query is too vague, too short or structured in a way that is not easy for the retriever to comprehend. This is addressed using some query optimisation techniques

Index Optimisation

There are three broad themes under which the knowledge base can be optimised –

a) Chunk Optimisation – where the chunks are tailored to the context

b) Metadata Enhancements – focusses on adding a metadata dimension to retrieval

c) Structural Improvements – optimises the way documents are stored in the knowledge base.

a) Chunk Optimisation

The role of chunking in RAG and the underlying idea is somewhat similar to what it is in real life. Once you've extracted and parsed text from the source, instead of committing it all to memory as a single element, you break it down into smaller chunks. It leads to better retrieval of information. If a chunk represents a single idea (or fact) it can be retrieved with more confidence that if there are multiple ideas (or facts) within the same chunk. It also leads to better generation. The retrieved chunk has information that is focused on the user query and does not have any other text that may confuse the LLM. Therefore, the generation is more accurate and coherent.

Chunking Methods (Source: Image by Author)

Chunking therefore plays a crucial role in how the RAG system will perform. There are certain techniques that can optimise chunks. Some of them are –

Chunk size optimisation – The size of the chunks can have a significant impact on the quality of the RAG system. While large sized chunks provide better context, they also carry a lot of noise. Smaller chunks, on the other hand, have precise information but they might miss important information. For instance, consider a legal document that's 10,000 words long. If we chunk it into 1,000-word segments, each chunk might contain multiple legal clauses, making it hard to retrieve specific information. Conversely, chunking it into 200-word segments allows for more precise retrieval of individual clauses, but may lose the context provided by surrounding clauses. Experimenting with chunk sizes can help find the optimal balance for accurate retrieval.
Context Enriched Chunking – This method adds the summary of the larger document to each chunk to enrich the context of the smaller chunk. This makes more context available to the LLM without adding too much noise. It also improves the retrieval accuracy and maintains semantic coherence across chunks. This is particularly useful in scenarios where a more holistic view of the information is crucial. While this approach enhances the understanding of the broader context, it adds a level of complexity and comes at the cost of higher computational requirements, increased storage needs and possible latency in retrieval.
Agentic Chunking – In agentic chunking, chunks from the text are created based on a goal or a task. Consider an e-commerce platform wanting to analyse customer reviews. The best way for the reviews to be chunked is if the reviews pertaining to a particular topic are put in the same chunk. Similarly, the critical reviews and positive reviews may be put in different chunks. To achieve this kind of chunking, we will need to do sentiment analysis, entity extraction and some kind of clustering. This can be achieved by a multi-agent system. Agentic chunking is still an active area of research and improvement.

More ideas like semantic chunking, small-to-big, sliding windows etc. along with code and can be explored in-depth in the blog below

Breaking It Down : Chunking Techniques for Better RAG

b) Metadata Enhancements

A common way of defining metadata is "data about data". Metadata describes other data. It can provide information like a description of the data, time of creation, author, etc. While metadata is useful for managing and organising data, in the context of RAG, metadata enhances the search-ability of data. A few ways in which metadata is crucial in improving RAG systems are –

Metadata filtering – Adding metadata like timestamp, author, category, etc. can enhance the chunks. While retrieving, chunks can first be filtered by relevant metadata information before doing a similarity search. This improves retrieval efficiency and reduces noise in the system. For example, using the timestamp filters can help avoid outdated information in the knowledge base. If a user searches for ‘latest travel guidelines,' metadata filtering by timestamp ensures that only the most recent guidelines are retrieved, avoiding outdated information.
Metadata enrichment: – Time stamp, author, category, chapter, page number etc. are common metadata elements that can be extracted from documents. However, even more valuable metadata items can be constructed. This can be a summary of the chunk, extracting tags from the chunk. One particularly useful technique is Reverse Hypothetical Document Embeddings. It involves using a language model to generate potential queries that could be answered by each document or chunk. These synthetic queries are then added to the metadata. During retrieval, the system compares the user's query with these synthetic queries to find the most relevant chunks.

The code below uses GPT4o-mini to extract metadata from the chunks and adds that to the embeddings. These metadata tags are then used for filtering.

import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()

# Function to extract fixed metadata using GPT-4o-mini with JSON response
def extract_fixed_metadata_from_chunk(chunk_text):
    prompt = f"""
    Extract the following fixed metadata in JSON format from the given text:
    {{
      "player_1": "",
      "player_2": "",
      "player_3": "",
      "player_4": "",
      "player_5": "",
      "team_1": "",
      "team_2": "",
      "team_3": "",
      "team_4": "",
      "team_5": "",
      "keyword_1": "",
      "keyword_2": "",
      "keyword_3": "",
      "keyword_4": "",
      "keyword_5": ""
    }}
    Here's the text:
    {chunk_text}
    """

    # Call GPT-4o-mini to extract structured metadata in JSON format
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )

    # Extract the response in JSON format
    metadata_response = response.choices[0].message.content
    print(metadata_response)
    try:
        # Convert the response into a dictionary
        metadata = eval(metadata_response)  # This ensures it is a valid dictionary
    except Exception as e:
        print(f"Error parsing metadata: {e}")
        metadata = {
            "player_1": "", "player_2": "", "player_3": "", "player_4": "", "player_5": "",
            "team_1": "", "team_2": "", "team_3": "", "team_4": "", "team_5": "",
            "keyword_1": "", "keyword_2": "", "keyword_3": "", "keyword_4": "", "keyword_5": ""
        }
    return metadata

# Step 1: Load data from a URL (Wikipedia page)
url = "https://en.wikipedia.org/wiki/2024_ICC_Men%27s_T20_World_Cup"
loader = AsyncHtmlLoader(url)
data = loader.load()

# Step 2: Transform the HTML content to plain text
html2text = Html2TextTransformer()
data_transformed = html2text.transform_documents(data)

# Step 3: Split the text into smaller chunks using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10000,  # Number of characters in each chunk
    chunk_overlap=200  # Number of overlapping characters between chunks
)
chunks = text_splitter.split_text(data_transformed[0].page_content)

# Step 4: Initialize OpenAI Embeddings model
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")

# Step 5: Initialize FAISS index for L2 (Euclidean) distance
embedding_dim = len(embedding_model.embed_query("hello world"))
index = faiss.IndexFlatL2(embedding_dim)

# Step 6: Initialize the InMemoryDocstore to store documents and metadata in memory
docstore = InMemoryDocstore()

# Step 7: Create FAISS vector store using the embedding function, FAISS index, and docstore
vector_store = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=docstore,
    index_to_docstore_id={}
)

# Step 8: Add chunks (documents) with extracted metadata and embeddings to FAISS vector store
documents = []
for i, chunk in enumerate(chunks):
    # Extract fixed metadata using the LLM
    extracted_metadata = extract_fixed_metadata_from_chunk(chunk)

    # Create a document object with both the chunk content and the extracted metadata
    document = Document(
        page_content=chunk, 
        metadata={
            "source": url, 
            "category": "cricket world cup",
            "extracted_metadata": extracted_metadata  # Store the structured metadata
        }
    )

    # Append the document to the list
    documents.append(document)

# Create unique IDs for each chunk
ids = [f"chunk_{i}" for i in range(len(chunks))]

# Add the documents and their embeddings to the FAISS vector store
vector_store.add_documents(documents=documents, ids=ids)

# Step 9: Define a function to extract metadata from a query
def extract_fixed_metadata_from_query(query_text):
    prompt = f"""
    Extract the following fixed metadata in JSON format from the query:
    {{
      "player_1": "",
      "player_2": "",
      "player_3": "",
      "player_4": "",
      "player_5": "",
      "team_1": "",
      "team_2": "",
      "team_3": "",
      "team_4": "",
      "team_5": "",
      "keyword_1": "",
      "keyword_2": "",
      "keyword_3": "",
      "keyword_4": "",
      "keyword_5": ""
    }}
    Here's the query:
    {query_text}
    """

    # Call GPT-4o-mini to extract structured metadata from the query
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )

    # Extract the response in JSON format
    metadata_response = response.choices[0].message.content
    try:
        # Convert the response into a dictionary
        metadata = eval(metadata_response)
    except Exception as e:
        print(f"Error parsing metadata: {e}")
        metadata = {
            "player_1": "", "player_2": "", "player_3": "", "player_4": "", "player_5": "",
            "team_1": "", "team_2": "", "team_3": "", "team_4": "", "team_5": "",
            "keyword_1": "", "keyword_2": "", "keyword_3": "", "keyword_4": "", "keyword_5": ""
        }
    return metadata

# Step 10: Extract metadata from the query
query = "Virat Kohli records in 2024 World Cup"
query_metadata = extract_fixed_metadata_from_query(query)

# Step 11: Define a metadata filter based on the query's extracted metadata
def metadata_filter(doc_metadata):
    query_players = {query_metadata[f"player_{i}"] for i in range(1, 6) if query_metadata[f"player_{i}"]}
    query_teams = {query_metadata[f"team_{i}"] for i in range(1, 6) if query_metadata[f"team_{i}"]}
    query_keywords = {query_metadata[f"keyword_{i}"] for i in range(1, 6) if query_metadata[f"keyword_{i}"]}
    doc_players = {doc_metadata["extracted_metadata"][f"player_{i}"] for i in range(1, 6) if doc_metadata["extracted_metadata"][f"player_{i}"]}
    doc_teams = {doc_metadata["extracted_metadata"][f"team_{i}"] for i in range(1, 6) if doc_metadata["extracted_metadata"][f"team_{i}"]}
    doc_keywords = {doc_metadata["extracted_metadata"][f"keyword_{i}"] for i in range(1, 6) if doc_metadata["extracted_metadata"][f"keyword_{i}"]}

    # Check if there's any overlap between the query metadata and document metadata
    return bool(query_players & doc_players or query_teams & doc_teams or query_keywords & doc_keywords)

# Step 12: Perform a similarity search on the stored chunks with the metadata filter
results = vector_store.similarity_search(query=query, k=3, filter=metadata_filter)

# Step 13: Display the results with metadata
for doc in results:
    print(f"Document: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

Metadata is a great tool in your repertoire for improving the accuracy of the retrieval system. However, a degree of caution must be exercised while adding metadata to the chunks. Designing the metadata schema is important to avoid redundancies and managing processing and storage costs. Providing improved relevance and accuracy, metadata enhancement has become extremely popular in contemporary RAG systems.

c) Structural Improvements

The way information is structured is another important aspect of the knowledge base. In a naive approach, there's no particular structure to the documents stored in the vector store. To improve the search-ability of the knowledge base, structures are introduced. These can be:

Parent-child Indexing – In a parent-child document structure, documents are organised hierarchically. The parent document contains overarching themes or summaries, while child documents delve into specific details. During retrieval, the system can first locate the most relevant child documents and then refer to the parent documents for additional context if needed. This approach enhances the precision of retrieval while maintaining the broader context. At the same time, this hierarchical structure can present challenges in terms of memory requirements and computational load.
Knowledge Graphs – Knowledge graphs, or KGs, have revolutionised the storage network data in a structured format. KGs store data as entities and relationships amongst the entities. A graphical index structure increases contextual understanding and enhances reasoning by establishing 2nd/3rd degree connections. Microsoft GraphRAG is an open-source framework that facilitates creation of KG and graph communities which enhances contextual retrieval.

Advanced Indexing Pipeline (Source: Image by Author)

Query Optimisation

Now that the index is optimised, let's shift focus to the other input to the generation pipeline – user query. Query optimisation can also be looked in three ways –

Query expansion which expands the original query to cover broader context
Query transformation which changes the user query to make it more meaningful
Query routing which chooses different retrieval methods depending on the type of query being asked

Query Expansion

Sometimes the response users are looking for is beyond what is being directly asked in the query – especially when looking for long form generations like blogs and articles. It is also possible that the user themselves don't have clarity on the detailes they need. In such cases, for a comprehensive response from a RAG system, we can benefit by expanding the query. Though there are many ways of doing this, we will illustrate three popular ones –

Multi-query expansion: In this approach, an LLM is asked to generate variations of the original query and each variant is used to search and retrieve chunks.
Sub-query expansion: Almost like the multi-query expansion but instead of generating variations, this approach asks an LLM to break a complex original query into simpler queries. This approach is inspired from the least to most prompting technique where complex problems are broken down into simpler subproblems and are solved one by one.
Step-back expansion: This approach draws inspiration from step-back prompting technique. Here, the original query is supplemented with an abstracted higher level conceptual query.

Query Expansion Techniques (Source: Image by Author)

The code example below illustrates expansion of the query – "How does climate change affect polar bears?"

original_query="How does climate change affect polar bears?"
num=5

response_structure='''
{
    "queries": [
        {
            "query": "query",
    },
    ...
]}
'''

expansion_prompt=f"Generate {num} variations of the following /
query: {original_query}. Respond in JSON format.Stick to this Structure/
 :n{response_structure}"

step_back_expansion_prompt = f"Given the query: '{original_query}', /
generate a more abstract, higher-level conceptual query."

sub_query_expansion_prompt=f"Break down the following query into {num} 
sub-queries targeting different aspects of the query: '{original_query}'. 
Respond in JSON format."

# Importing the OpenAI library
from openai import OpenAI

# Instantiate the OpenAI client
client = OpenAI()

# Make the API call passing the augmented prompt to the LLM
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages= [
    {"role": "user", "content": expansion_prompt}
    ],
          response_format={ "type": "json_object" }
)

# Extract the answer from the response object
multi_queries=response.choices[0].message.content

# Make the API call passing the augmented prompt to the LLM
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages= [
    {"role": "user", "content": step_back_expansion_prompt}
    ],
          response_format={ "type": "json_object" }
)

# Extract the answer from the response object
step_back_query=response.choices[0].message.content

# Extract the answer from the response object
answer=response.choices[0].message.content

# Make the API call passing the augmented prompt to the LLM
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages= [
    {"role": "user", "content": sub_query_expansion_prompt}
    ],
          response_format={ "type": "json_object" }
)

# Extract the answer from the response object
sub_query=response.choices[0].message.content

Query Transformation

Query transformation approaches differ from query expansion in a manner that creates an entirely new query instead of just expanding the original query. The original query may be discarded in the process. Query transformation aligns the query with the retriever. Two such approaches are –

Query rewriting: Sometimes user queries in consumer facing apps may not directly be searchable in the documents. For example, a query like "My can't send emails from my phone" may need to be rewritten as "Show me troubleshooting steps for resolving email sending issues on smartphones" to make it more aligned to the retriever.
Hypothetical Document Embeddings (HyDE): In the end of 2022, prominent NLP researcher Luyu Gao and others, proposed HyDE in their paper Precise Zero-Shot Dense Retrieval without Relevance Labels. In this technique, the LLM first generates answers without retrieving from the knowledge base (hence the term hypthetical). The generated answer is then used for retrieval instead of the original query. The idea being that a hypothetical answer will have more relevant terms to search compared to just the query. The code below generates the hypothetical answer that is further used for retrieval.

original_query="How does climate change affect polar bears?"
system_prompt="You are an expert in climate change and arctic life."
hyde_prompt=f"Generate an answer to the question: {original_query}"
# Make the API call passing the augmented prompt to the LLM
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages= [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": hyde_prompt}
  ]
)

# Extract the answer from the response object
answer=response.choices[0].message.content

There are other query optimisation techniques. The applicability of any technique is dependent on the nature of content and the use case.

Advanced Retrieval Strategies

Optimisation of the query and the knowledge base may achieve significant gains in the performance of RAG because of the alignment it brings with the retriever. There are a variety of retrievers that can be used and Contextual Embeddings are the most prevalent retrievers in RAG applications. However, there are multiple other techniques that are employed depending on the use case at hand.

Various Retrieval Techniques (Source: Image by Author)

It's not just the retrieval technique being used but various strategies can be employed at the retrieval stage to improve the performance.

Hybrid Retrieval

Hybrid retrieval has almost become the minimum requirement for production RAG systems. The idea is simple – it combines various retrieval techniques like BM25, TF-IDF, Dense Embeddings, Graph search, and then either through a union or an intersection of the results, the final retrieved document list is finalised.

Repeated Retrieval

So far, we've discussed retrieval as a linear process without any evaluation or analysis of the retrieved documents. In a repeated retrieval strategy retrieval happens in repeated steps. There are three popular strategies under this –

Iterative Retrieval: In this approach the retrieval happens ‘n' number of times and the generated response is used to further retrieve documents each time. Iter-RetGen proposed in Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy by Shao et. al. is an example of iterative retrieval.
Recursive Retrieval: This approach improves upon the iterative approach by transforming the retrieval query after each generation. In Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, Trivedi et. al. have experimneted with the recursive retrieval approach.
Adaptive Retrieval: This is a more intelligent approach where an LLM determines the most appropriate moment and the most appropriate content for retrieval. Self RAG is an important adaptive retrieval approach that is gaining prominence in the applied RAG domain.

Iterative, Recursive and Adaptive retrieval incorporate repeated retrieval cycles. (Source – Adapted from Retrieval-Augmented Generation for Large Language Models: A Survey, Gao et al)

Post-retrieval techniques

Once the retrieval is complete, the final step is to present the augmented prompt to the LLM for generation. However, as we discussed above, a Naive approach may lead to the model becoming overwhelmed and confused. Therefore, advanced post-retrieval techniques are employed to remove the redundancy and repetition in the retrieved information and directing focus to the most relevant parts of the information.

Contextual Compression

An intuitive approach to reducing the noise in the retrieved information is to compress it. Compress here means to reduce the length of the retrieved information by extracting only the parts that are relevant and important to the query. This also has a positive effect on the cost and efficiency of the system. COCOM is a context compression method which compresses contexts into a small number of context embeddings. Similarly, xRAG is a method that uses document embeddings as features. Below is a very basic approach to compression using an LLM.

compress_prompt = 
f"We have to answer the following question: n Question: {query}n /
Compress the following document into very short /
sentences, retaining only the extremely essential information:/
nn{document_to_compress}"

Reranking

Retrieved information from different sources and techniques can further be ranked to determine the most relevant documents. Reranking, like hybrid retrieval, is commonly becoming a necessity in production RAG systems. To this end, commonly available rerankers like multi-vector, Learning to Rank (LTR), BERT based and even hybrid rerankers that can be employed are gaining prominence. Specialised APIs like Cohere Rerank offer pre-trained models for efficient reranking integration.

Rerank

An advanced RAG pipeline (Source: Image by Author)

The list of advanced optimisations and improvements is endless. Depending on the requirements of use case, the nature of content and the user queries, several custom methods are also employed. You should be open to experimentation with your RAG pipelines as the results of some of these techniques can pleasantly surprise you. But, it is also important to understand what are the implications of these advanced techniques and the costs associated with them.

Trade-offs and other considerations

There are a few important factors to consider while bringing in these advanced interventions:

Performance vs. Complexity Trade-off – While improving performance each of these advanced techniques add a layer of complexity to the system.
Customisation for Specific Use Cases – One system does not fit all use cases and queries, and hence, it is important to engineer the pipeline to the expected queries and outcomes.
Balancing Recall and Precision – While retrieving more relevant information is a common theme of these approaches the choice of being precise with retrieved information vs being comprehensive always need to be made
Scalability and Modular Approaches – Modularisation of the architecture should always be employed so that components can be scaled, updated or swapped with the evolving nature of the requirements.
Reducing Noise and Managing Information Overload – Retrieval of more information also adds to the noise in the system and post retrieval techniques manage this noise. However, you may run a risk of over compression which can undo all the improvements of the previous steps.
Costs of Enhanced Accuracy – Each additional step will introduce latency into the system and steps where LLMs or other models have to be employed will increase the computational overload to the system.

The deployment of advanced techniques should not be a carpet bombing approach but experimentation needs to dictate which techniques should be put in production.

We've discussed quite a lot of things in this blog.

LLMs have limitations and RAG is a novel idea that addresses these limitations to make LLMs usable in many real world scenarios.
To create RAG systems we need a pipeline to manage the knowledge base and another to interact with the user.
A basic RAG system, also called Naive RAG, is marred with inefficiencies.
Advanced RAG approach improves this basic Retrieve-Read framework to a Rewrite-Retrieve-Rerank-Read approach.
Pre-retrieval interventions make the system align better to the retrieval and involve optimisation of the knowledge base and the user query.
Retrieval strategies include hybrid and repeated retrieval approaches.
In the post retrieval stage, a reduction of noise and better alignment of the retrieved information to the LLM is achieved.
Advanced techniques bring an improvement in the RAG performance but come with trade-offs. Which techniques to employ depends on the use case and needs experimentation to understand the gains and the costs.

I hope you had fun reading this blog and you found it useful. This blog is inspired from chapter 6 of my book A Simple Guide to Retrieval Augmented Generation.

A Simple Guide to Retrieval Augmented Generation

What do you think about improving RAG performance? Is RAG a technique that you use in your work? What do you find challenging? Are there any other techniques that you have found useful? I will be highly obliged if you let me know in the comments.

If you like this story, please clap, comment and share with your network

My name is Abhinav, and I talk about Data Science, Machine Learning, and AI. If our interests align, I'd love to stay connected on LinkedIn, Twitter, Instagram, and Medium.

Read my other blogs –

Stop Guessing and Measure Your RAG System to Drive Real Improvements

Breaking It Down : Chunking Techniques for Better RAG

Generative AI Terminology – An evolving taxonomy to get you started

Getting the Most from LLMs: Building a Knowledge Brain for Retrieval Augmented Generation

Tags: Deep Dives Generative Ai Tools Langchain Large Language Models Retrieval Augmented