Code understanding on your own hardware

Author:Murphy | View: 26637 | Time: 2025-03-23 18:11:10

I promise your code won't leave your local hardware. Photo by Clément Hélardot on Unsplash

Among the various tasks Large Language Models (LLMs) can perform today, Code Understanding may be of particular interest for you, if you work with source code as a software developer or a data scientist. Wouldn't it be great to have a chatbot you can ask questions about your code? Where is the data preprocessing implemented? _Is there a function for verifying the user's authentication already? What is the difference between the calculate_vector_dim and calculate_vectordimension function? Instead of searching for the correct file yourself, just ask the bot and it gives you an answer, together with a pointer to the files that contain the relevant code snippets. That mechanism is called semantic search, and you can imagine how useful it is.

In this tutorial, I will show you how to implement a LangChain bot that does exactly that. In addition, I will focus on the specific, data-privacy-related issue of not giving your code out of hand. The code you or your company produced is private property and may contain sensitive information or valuable knowledge. You may not want to, or your company's policies may not allow you to send it to an LLM hosted by another company, that may be located in a foreign country. Hence in this tutorial, I will show you how to set up a code understanding bot that runs on your local hardware, so your code never leaves your infrastructure.

Let's start already! First I will give you a brief introduction to the general process of semantic search before we implement a bot for code understanding.

Introduction to semantic search

In semantic search, it's all about finding the relevant documents. Photo by Markus Spiske on Unsplash

First of all, let me briefly explain the general idea of semantic search. This approach consists of two main steps, that are the retrieval and the answer generation by the LLM itself. In the retrieval step, documents are selected that contain relevant information, and those are fed into the LLM to create a natural language answer. For example, if you ask a question about a function called _transformvectors, the retrieval will select those files that are relevant to answer that question. That may include the file where the _transformvectors function is implemented, but also files using it or parts of the documentation mentioning it. In the second step, those files' content is given to the LLM in a prompt that may look somewhat like that:

"""Answer the question below given the context. 


...


Question: 
Answer:
"""

The LLM creates a natural language answer to the question using information from the documents given to it.

That is the main idea of semantic search. Now let's start implementing! First of all, we have to install our requirements and read in our data.

Install requirements

Before we can start, make sure you have set up an environment running Python and install the following packages:

pip install Langchain==0.0.191
pip install transformers

Read in the documents

Now we need to read in the data and convert it into a format LangChain can work with. For this demonstration, I will download the code of LangChain itself, but you can use your own code base, of course:

import os

folder_name = "sample_code"
os.system(f"git clone https://github.com/hwchase17/langchain {folder_name}")

We load all files and convert them to a Document each, i.e. each Document will contain exactly one file of the code base.

from langchain.docstore.document import Document

documents = []
for root, dirs, files in os.walk(folder_name):
    for file in files:
        try:
            with open(os.path.join(root, file), "r", encoding="utf-8") as o:
                code = o.readlines()
                d = Document(page_content="n".join(code), metadata={"source": os.path.join(root, file)})
                documents.append(d)
        except UnicodeDecodeError:
            # some files are not utf-8 encoded; let's ignore them for now.
            pass

Retrieval

Now that we have created our Documents, we need to index them to make them searchable. To index a Document means to calculate a numerical vector, that captures the most relevant information of the Document. Unlike plain text itself, a vector of numbers can be used to perform numerical calculations, and that means that we can easily calculate a similarity on it, which is then used to determine which Documents are relevant to answer a given question.

On a technical level, this index we will create with the help of an embedding and store it in a VectorStore. There are VectorStores available as a service (e.g. DeepLake), which comes with some handy advantages, but in our scenario, we don't want to give the code out of our hands, so we create a VectorStore locally on our machine. The easiest way to do that is using Chroma, which creates a VectorStore in memory and allows us to persist it.

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

hfemb = HuggingFaceEmbeddings(model_name="krlvi/sentence-t5-base-nlpl-code-x-glue")
persist_directory = "db"
db = Chroma.from_documents(documents, hfemb, persist_directory=persist_directory)
db.persist()

Within the _fromdocuments function, the indices are calculated and stored in the Chroma database. Next time, instead of calling the _fromdocuments function again, we can load the persisted Chroma database itself:

db = Chroma(persist_directory=persist_directory, embedding_function=hfemb)

As you saw above, as an embedding I used _krlvi/sentence-t5-base-nlpl-code-x-glue_, which is an embedding that was trained on code from open-source GitHub libraries. As you can imagine, it is crucial that the embedding we use has been trained on code (among other data), so it can make use of the data we feed it with. An embedding, that was trained on natural language only, will perform less well, most likely.

Now that we have our VectorStore and our embedding, we can create the retriever from the Chroma database directly:

retriever = db.as_retriever()

LLM

The LLM has to do the reasoning over the documents and come up with an answer to the user's question. Photo by Tingey Injury Law Firm on Unsplash

The last component we need is an Llm. The easiest solution would be to use a hosted LLM, e.g. by using the OpenAI interface. However, we don't want to send our code to such a hosted service. Instead, we will run an LLM on our own hardware. To do that we use the HuggingFacePipeline, which allows us to use a model from HuggingFace in the LangChain framework.

from langchain import HuggingFacePipeline
import transformers

model_id = "mosaicml/mpt-7b-instruct"
config = transformers.AutoConfig.from_pretrained(model_id,trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)
pipe = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=pipe)

As you see, I used the mosaic mpt-7b model, which only needs ~16GB memory on a GPU. I created an AutoModelForCausalLM, which is passed into the transformers.pipeline, which is eventually being transformed into a HuggingFacePipeline. The HuggingFacePipeline implements the same interface as the typical LLM objects in LangChain. That is, you can use it in the same way as you would use the OpenAI LLM interface, for example.

If you have multiple GPUs on your machine, you have to specify which one to use. In this case, I want to use the GPU with index 0:

config.init_device="cuda:0"
model.to(device='cuda:0')
pipe = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100, device=0)

Some additional parameters I have set above can be explained as follows:

_trust_remotecode: This has to be set to true to allow running a model coming from outside LangChain.
_max_newtokens: This defines the maximum number of tokens the model may produce in its answer. If this value is too low, the model's response may be cut off before it was able to answer the question at all.

Connect everything together

We have all the components we need. We just have to plug it all together. Photo by John Barkiple on Unsplash

Now we have all the components we need and can combine them in a ConversationalRetrievalChain.

from langchain.chains import ConversationalRetrievalChain

qa_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, return_source_documents=True)

Eventually, we can query the chain to answer our questions. The result object will include a natural language answer and a list of _sourcedocuments that were consulted to arrive at that answer.

result = qa_chain({"question":"What is the return type of the create_index function in the KNNRetriever?", "chat_history":[]})
print(f"Answer: {result['answer']}")
print(f"Sources: {[x.metadata['source'] for x in result['source_documents']]}")

Here is the answer:

Answer:  The return type of the create_index function in the KNNRetriever is np.ndarray.
Sources: ['sample_code/langchain/retrievers/knn.py', 'sample_code/langchain/vectorstores/elastic_vector_search.py', 'sample_code/langchain/vectorstores/elastic_vector_search.py', 'sample_code/langchain/vectorstores/opensearch_vector_search.py']

Summary

We're done! Well, kind of. With the code above we are now able to ask questions regarding the source code. However, there are some steps you may want to alter according to your needs

Use your own source code as Documents instead of LangChain's code.
Try a different embedding. If the embedding doesn't fit, the retriever cannot find the right documents, and in the end, the questions cannot be answered precisely.
Try a different model. There are bigger, more powerful models outside, but some may be too big to run on your hardware. You have to find the sweet spot where you have decent performance but are still able to run the model in a satisfying way.
Try different ways of preprocessing the Documents to facilitate the retrieval step. A common example would be to split them into chunks of equal length.

I'm sure there is much more to try out to obtain better performance. Just play around and adapt the bot to your needs.