Build More Capable LLMs with Retrieval Augmented Generation

The Limitations of ChatGPT
ChatGPT is limited for many practical business use cases outside of code generation. The limitation arises from the training data, and the model's propensity to hallucinate. At the time of writing, if you try to ask the Chat-GPT questions about events occurring after September 2021, you will probably receive a response like this:

This isn't helpful, so how can we go about rectifying it?
Option 1 – Train or fine-tune the model on up-to-date data.
Fine-tuning or training a model can be impractical and expensive. Putting aside the costs, the effort required to prepare the data sets is enough to forgo this option.
Option 2 – Use retrieval augmented generation (RAG) methods.
RAG methods enable us to give the large language model access to an up-to-date knowledge base. This is much cheaper than training a model from scratch or fine-tuning, and much easier to implement. In this article, I show you how to leverage RAG with your OpenAI model. We will put the model to the test by conducting a short analysis of its ability to answer questions about the Russia-Ukraine conflict of 2022 from a Wikipedia knowledge base.
Note: This topic, although sensitive, was chosen for the obvious reason that the current ChatGPT model has no knowledge of it.
Libraries & Pre-requisites
You will require an OpenAI API key, you can grab one directly from their website or follow this tutorial. The framework used for RAG is Haystack by Deepset, which is open source. They provide APIs enabling you to build applications on top of large language models. We also leverage sentence transformers and the transformers library from Hugging Face.
Sentence Embeddings Help Models Interpret Text
Before getting into the technical details around building, we should briefly cover sentence embedding. Understanding this concept is key to gaining an intuition into how RAG methods work.
This may be a cliché, especially for those from a data science background, but models don't actually understand text, they only understand numbers. Much of language modelling is about formulating ways to effectively encode text into numbers, and currently, we do this with sentence embeddings.
Sentence embeddings are a way for us to represent sentences as a dense vector while preserving the semantic structure. Embeddings are learnt from a dense layer in a deep neural network, the structure of which can vary from network to network.
In much simpler terms, sentence embeddings can be thought of as numeric representations of our sentences that preserve information about their meaning. We can get our sentence embeddings from pre-trained models. Hugging Face provides open-source models for this via their sentence transformers library.
Pre-processing and Storage
Before we can build our RAG-enabled model, we need to pre-process and store our documents. Let's explore how we do this, but first take note of the architecture of this process to help with your understanding.
Note: the architectural diagram also applies to the extractive question answer pipeline we define in the next section.

Pre-processing our Documents
Haystack provides us tools for easily pre-processing most types of text documents (.pdf, .txt, .docx included). The preprocessing steps are simple; we read in our knowledge base using the convert_files_to_docs() function, which can automatically detect the file type and convert it to the format we need to work with downstream.
Haystack also provides a PreProcessor class that enables us to apply preprocessing steps to our documents. The steps you apply will depend very much on your specific application.
Note: the processor will enable you to split a long document into a list of smaller documents, defined as sentences. For question-and-answer applications, a common approach is to have some overlap between sentences; I have set this at roughly 30%.
Vector Store
We leverage FAISS, a library by Meta that enables efficient similarity search over our sentence embeddings. The importance of this will become more obvious in the coming sections. The script below shows how we can set up our pre-process for our text documents and set up the FAISS vector store.
The full pre-processing script is outlined below:
Defining an Extractive Question-Answer Pipeline
The next step is to construct our extractive question answer pipeline. Our pipeline is defined by nodes which are run sequentially as a directed acyclic graph (DAG). In this case, the pipeline consists of two nodes, a retriever, and a reader.
Retriever
Retrieval is the method used to find the relevant information from a knowledge base based on the user's query. When defining our retriever node, we specify a sentence embedding model from the sentence transformers library, in this case, we use all-mpnet-base-v2, which creates 768-dimensional embeddings.
Once we have fully defined our retrieval node, we can compute and store the sentence embeddings in our FAISS vector store. The same sentence embedding model is used to generate sentence embeddings for the user's query.
Note: when selecting your sentence embedding model, there is a trade-off between computational efficiency and information loss. In general, sentence embedding models with higher dimensions capture more information but are less computationally efficient.
Remember, the ultimate purpose of the retriever node is to find the information that relates semantically to the query. This is done by performing a similarity search between the sentence embeddings from the query and the documents in our vector store. The top-k most relevant sentence embeddings are returned as an output from this node.
Reader
The reader node is a large language model that has been fine-tuned on question-answering tasks. For our use case, we leverage Roberta-base-squad2 as our reader model. The reader operates on the output from the retriever and the initial query from the user and returns the relevant span of text to answer the query. The reader will do this for each document in the retriever output, allocating a confidence score in each case. The answers are ranked by their score, and the top-k answers are returned.
The full script for the pipeline is outlined below:
Leveraging LLM-Powered Agents
Now that we have pre-processed our documents and defined our pipeline, we can construct our agent. An agent is powered by a large language model, in our case, OpenAI's gpt-4 (or gpt-3.5-turbo alternatively).
The agent we are using works on the basis of Zero-shot ReAcT (Reason + Act) prompting. We prompt the large language model to return verbal reasoning traces and actions for a task. The agent can "act" on the verbal reasoning traces using a set of tools we provide it access to. The agent can observe and reason about the output from using the tools, helping it to inform its next action.
The animation below gives a simplified view of our ReAct agent at work.

Note: Research suggests that ReAct prompting has been shown to effectively reduce hallucinations by LLMs.
The script below shows how we construct our agent:
In this use case, the tool we have provided the agent is the extractive question-answer pipeline we defined earlier. Essentially the agent can interpret our query, use our pipeline as a tool to find the relevant responses in our knowledge base, reason to see if this answers the query, choose to respond, or perform the action again to get a better answer.
A Brief Note on Tools
It's important that we define our tools clearly. For our particular use case, we only have one tool, but we need to provide a description of that tool so that the LLM-powered agent knows how and when to use the tool. This becomes even more important when an agent has access to more than just one tool.
Putting it to the Test
So we have defined our RAG-enabled model; let's put it to the test by asking it trivia on the Russia-Ukraine conflict of 2022. I have chosen two questions, each of which can be answered with information from the wikipedia knowledge base. As well as the responses I provide the workflow trace for each agent, you should inspect this carefully to gain a better insight into how the ReAct prompting works with the agent.
How did the international community respond to Russia's invasion of Ukraine in 2022?
Response from the RAG enabled model:
The international community responded to Russia's invasion of Ukraine in 2022 with condemnation and the imposition of new sanctions. These sanctions were comprehensive, affecting individuals, banks, businesses, and various types of economic activity including monetary exchanges, bank transfers, exports, and imports. Some of the specific measures included prison sentences of up to 15 years.
Note: The prison sentence for up to 15 years appears to be a hallucination. The model has mistakenly suggested this as a sanction measure.

Response from Google search:

Here's the trace tracking the agent's workflow:
Agent custom-at-query-time started with
{'query':
'How did the international community respond to
Russia's invasion of Ukraine in 2022?', 'params': None}
The 'transcript' parameter is missing from
the Agent's prompt template. All ReAct agents
that go through multiple steps to reach
a goal require this parameter. Please append
{transcript} to the end of the Agent's
prompt template to ensure its proper functioning.
A temporary prompt template with {transcript} appended
will be used for this run.
find out the details about the international
community's response to Russia's invasion of Ukraine
in 2022.
I don't have this information readily available,
so I will use the document_qa tool to
find it.
Tool: document_qa
Tool Input: "International community response to Russia's invasion of Ukraine
in 2022"
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.05it/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00, 9.98s/ Batches]
Observation: 308
Thought: The document_qa tool provided a response
code of 308, which is not informative.
I need to try again to get
the required information.
Tool: document_qa
Tool Input: "How did the international community respond to Russia's invasion
of Ukraine in 2022?"
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.72it/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.78s/ Batches]
Observation: condemnation
Thought: The document_qa tool responded with "condemnation".
This suggests that the international community condemned
Russia's invasion of Ukraine in 2022. However,
this is a very broad response and
doesn't provide specific details about the actions
taken by the international community. I need
to find more detailed information.
Tool: document_qa
Tool Input: "What specific actions did the international community take in response to Russia's invasion of Ukraine in 2022?"
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.41it/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.95s/ Batches]
Observation: new sanctions
Thought: The document_qa tool responded with "new
sanctions". This suggests that the international community
imposed new sanctions on Russia in response
to their invasion of Ukraine in 2022.
However, this still doesn't provide a comprehensive
picture of the international community's response. I
need to find more detailed information.
Tool: document_qa
Tool Input: "What were the new sanctions imposed by the international community on Russia in 2022?"
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.66it/s]
Inferencing Samples: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.77s/ Batches]
Observation: prison sentences of up to 15 years
Thought: The document_qa tool responded with "prison
sentences of up to 15 years". This
suggests that part of the sanctions imposed
by the international community on Russia in
2022 included prison sentences of up to
15 years. However, this seems to be
a specific measure and may not represent
the full range of sanctions. I need
to find more detailed information.
Tool: document_qa
Tool Input: "What other sanctions were imposed by the international community on Russia in 2022?"
Observation: individuals, banks, businesses, monetary exchanges,
bank transfers, exports, and imports
Observation: individuals, banks, businesses, monetaryexchanges, bank transfers,
exports, and importsThought: The document_qa tool responded
with "individuals, banks, businesses, monetary exchanges, bank
transfers, exports, and imports". This suggests that
the sanctions imposed by the international community
on Russia in 2022 were comprehensive, affecting
individuals, banks, businesses, and various types of
economic activity including monetary exchanges, bank transfers,
exports, and imports.
Final Answer: The international community responded
to Russia's invasion of Ukraine in 2022
with condemnation and the imposition of new
sanctions. These sanctions were comprehensive, affecting individuals,
banks, businesses, and various types of economic
activity including monetary exchanges, bank transfers, exports,
and imports. Some of the specific measures
included prison sentences of up to 15 years.
Who led the IAEA team that investigated the plant on 29 August 2022?
Response from the RAG enabled model:
Rafael Grossi led the IAEA team that investigated the plant on 29 August 2022.

Response from Google search:

Here's the trace tracking the agent's workflow:
Agent custom-at-query-time started with {'query': 'Who
led the IAEA team that investigated the
plant on 29 Augustxa02022?', 'params': None}The 'transcript'
parameter is missing from the Agent's prompt
template. All ReAct agents that go through
multiple steps to reach a goal require
this parameter. Please append {transcript} to the
end of the Agent's prompt template to
ensure its proper functioning.
A temporary prompt template with {transcript} appended will
be used for this run.
find out what the IAEA is and
what plant was investigated on 29 August 2022.
Then, I need to find out who led
the IAEA team during this investigation.
I don't have this information at hand,
so I will use the document_qa tool
to find it.
Tool: document_qa
Tool Input: Who led the IAEA team that
investigated the plant on 29 August 2022?
Observation: Rafael Grossi
Thought: The document_qa tool has provided the name
Rafael Grossi in response to the query.
This suggests that Rafael Grossi led the
IAEA team that investigated the plant on
29 August 2022.
Final Answer: Rafael Grossi led the IAEA team
that investigated the plant on 29 August 2022.
For your own curiosity, have a look at the responses from ChatGPT for the same questions. You can even try to ask the questions yourself to confirm the response.
Conclusion
Retrieval augmented generation (RAG) enables a large language model to connect with an existing knowledge base. RAG-enabled language models have access to up-to-date information, making them more useful across a variety of use cases.
The retriever and reader methods enable the models to query large corpuses of text, overcoming the issues faced by the limited context of large language models by themselves. Open-source frameworks like Haystack make it easy to build a RAG-enabled LLM prototype quickly.
Some key points to note here are that the performance of this method is only as good as the knowledge base provided. Also, the inference time can be sped up dramatically by deploying the model on suitable infrastructure.
For more complex queries, agents may get into situations where they can't respond in the alotted amount of steps. It would be interesting to see how the quality of responses varies by increasing the amount of steps, or adding a memory component to make the pipeline more conversational. Leveraging a more heavy duty sentence embedding model could also serve to improve overall performance.
The code base is in the GitHub repo (including the front-end) and is available for you to experiment with.
There is also a YouTube demo of project.
Thanks for reading.
If you're keen to enhance your skills in Artificial Intelligence, join the waiting list for my course, where I will guide you through the process of developing large language model powered applications.
If you're seeking AI-transformation for your business, book a discovery call today.
For more insights on artificial intelligence, Data Science, and large language models you can subscribe to the YouTube channel.