The Research Agent: Addressing the Challenge of Answering Questions Based on a Large Text Corpus

Introduction to the problem
In 2021, I started working on the challenge of answering questions based on a large corpus of text. In the era before the pre-trained transformers, this problem was a tough one to crack.
And to my frustration, I started my experiments with one of the most complex and intricate stories ever written, The Mahabharata. For those unfamiliar with the work, the Mahabharata is a collection of 18 books with a total of about 1.8 million words. It is the largest poem ever written with about 90,000 verses. It is roughly ten times the length of the Iliad and the Odyssey combined. But it is not only the length but also the breadth of Mahabharata that is staggering. Highly nonlinear and complex in its causes and effects, it has thousands of characters spanning seven generations, and out of those, not a single one is completely good or evil. It has profound philosophical commentaries on Duty (Karma), Choices and Human existence, especially on the conflicts of duties and the choices between multiple wrongs. The Bhagavad Gita (Key philosophy of Hinduism) is also a part of the 6th book of the Mahabharata.
I compiled the Mahabharata text data from multiple sources online into a clean data set. However, I could not find a method to implement meaningful QA on the text.
In less than two years, all that changed.
The rapid advancements in AI and Large pre-trained transformers are changing the world of technology profoundly and fundamentally. And I am fascinated by it, much like most techies these days are.
So, a few months ago, I returned to the problem with a naive knowledge of the newborn art of prompt engineering. But this time with a general idea of making an Autonomous Research Agent that can work with any complex knowledge base.
The Mahabharata is one of the most complex use cases. However, in every domain of knowledge, Law, Scientific research, Education, Medical, etc., every project starts with deep research on the prior art. So the problem is worthy enough to solve.
The Research Agent
Here, I will discuss the design and implementation of an Autonomous AI Research Agent that can solve the problem of multi-hop KBQA with deep reasoning capability. I will share the git repo with an initial implementation of the research agent in a Python notebook. If you are interested only in that part, please feel free to skip to the Implementation section later in this article.
If you are interested in knowing more about AI Agent, ‘Knowledge-Based Question Answer' (KBQA), the ‘Why', the ‘What', and the design evolution of the AI Research Agent, then please read along.
Why?
The first question that one may ask is why not just use the ChatGPT interface and ask questions. It has been trained on a humungous volume of Internet data generated till 2021, so a text corpus like the Mahabharata is known to it.
That was my first approach. I asked the ChatGPT several questions about the Mahabharata. I got good answers to some questions. However, they lack the rigour for the most. And that is expected. The GPT is trained over general data sets. It can very well understand and interpret natural languages. It can also reason well enough. However, it is not an expert in any specific domain. So, while it might have some knowledge of The Mahabharata, it may not respond with deeply researched answers. At times the GPT may not have any answer at all. In those cases, it either humbly refuses to answer the question, or confidently makes them up (Hallucinations).
The second most obvious way to achieve KBQA is to use a Retrieval QA Prompt. Here is where Langchain starts being extremely useful.
Retrieval QA
For those unfamiliar with the LangChain library, It is one of the best ways to use LLMs like GPT in your code. Here is an implementation of KBQA using LangChain.