How to Talk to a PDF File Without Using Proprietary Models: CLI + Streamlit + Ollama

Author:Murphy | View: 24586 | Time: 2025-03-23 11:51:24

I have already read various articles on the internet about how the open source framework Streamlit can be used in combination with machine learning to quickly and easily create interesting interactive web applications. This is very useful for developing experimental applications without extensive front-end development. One article showed how to create a conversation chain using an OpenAI language model and then execute it. An instance of the chat model "gpt-3.5-turbo" was created, the parameter "temperature" was defined with a value of 0 so that the model responds deterministically and finally a placeholder for the API key was implemented. The latter is required to authenticate the model when it is used.

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key="")

In the comments, I often read the question of how to deal with a particular error message or how it can be solved.

_RateLimitError: Error code: 429 – {‘error': {‘message': ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', ‘type': ‘insufficient_quota', ‘param': None, ‘code': ‘insufficientquota'}}

Error 429 indicates that the request sent to the OpenAI API has exceeded the current usage quota. The available API calls in a certain period or the general usage limit of the subscription have been reached. The error is easy to solve. You take out a paid subscription with the respective provider and thus increase your usage limit. This gave me the idea of why you can't simply use open source models that are operated locally and thus circumvent the limitation without having to pay anything.

In this article, I will show how to use Streamlit to create an application to which PDF files can be uploaded in order to ask questions based on their content, which are answered by integrating an LLM. There are no limits or costs incurred when using the app. The response time (input-output) will take a little longer, depending on the system, but remains within reasonable limits. First, we take care of the LLM that we will use.

A KINGDOM FOR A LLAMA

We will use the open source language model Llama from Meta AI. As part of the recent development in the field of large language models, it will be used within the app to understand and generate natural language (NLP). In order to use the LLM locally, we first need to install Ollama on our system. To do this, we go to the following official site and download the open source platform. The system may need to be restarted afterward.

Once Ollama has been installed, we click on "Models" and select the "llama3.1" model in the overview that opens. Llama is based on the Transformer architecture, has been trained on large and diverse data sets, is available in different sizes and is ideally suited for the development of practical applications due to its openness and accessibility. In this article, the smallest size "8B" is used to ensure that the app also works on less powerful systems. Once the correct model has been selected, copy the command shown and execute it in the terminal.

LLama3.1 overview on Ollama platform (Public Domain)

Ollama run llama3.1

Once the model has been downloaded, you can communicate with it via the terminal. Next, let's move on to setting up the app. In a nutshell, the process is as follows. The PDF file is uploaded and the text it contains is extracted. The extracted text is divided into smaller chunks that are stored in a vector store. The user enters a question. The question, i.e. the input, is prepared for the model by combining the question and the context. The LLM is queried and generates the answer.

PDF CHAT APP [REQUIRED LIBRARIES]

Various libraries are required for the application to function correctly, which are briefly described below. The execution of system commands in Python and communication with them is made possible by "subprocess". We need "streamlit" to create the web application. "PyPDF2" is used to read PFD documents. The splitting of texts into smaller sections is done by _"langchain.textsplitter.RecursiveCharacterTextSplitter". The library _"langchaincommunity.embeddings.SpacyEmbeddings" is used to generate text embeddings with the Spacy model. The vector store _"langchaincommunity.vectorstores.FAISS" enables the efficient saving and retrieval of embeddings. For the definition of prompt templates for chat interactions, "_langchaincore.prompts.ChatPromptTemplate" is used. Access to operating system functions is obtained via "os" and "re" is used to recognize patterns in character strings. Python should also be installed on the system. Depending on the operating system, the required execution file can be downloaded from the official site. After installation, you can check whether the installation was successful using the terminal with the following command.

python --version

The required libraries are installed via the terminal with the following command:

pip install streamlit PyPDF2 langchain langchain-community spacy faiss-cpu

"subprocess", "os" and "re" are built-in Python libraries and do not need to be installed separately. The Spacy language model, however, must be downloaded separately with the following command.

python -m spacy download en_core_web_sm

The list of dependencies for the app is as follows.

import subprocess
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.tools.retriever import create_retriever_tool
from langchain_core.prompts import ChatPromptTemplate
import os
import re
import psutil

Now that everything you need is in place, let's move on to setting up the app. The various components of the script are described below.

PDF CHAT APP [ENVIRONMENT CONFIGURATION]

To avoid problems when loading libraries, especially during parallel processing, it is necessary to set the environment variable _"KMP_DUPLICATE_LIBOK" to "TRUE". In the context of this article, this configuration is due to the use of FAISS [Facebook AI Similarity Search], which uses parallel computing operations when searching data sets.

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

PDF CHAT APP [PDF READING FUNCTION]

The _"pdfread()" function reads the entire text from a PDF file. Specifically, "PyPDF2" is used to extract the text. The text is then combined into a single character string "text", which is returned. The function is important in order to make the content of the PDF file available for further processing steps.

def pdf_read(pdf_doc):
    """Read the text from PDF document."""
    text = ""
    for pdf in pdf_doc:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text
    return text

In principle, several PDF files can be uploaded at the same time, whereby these together form the context. If you want to analyze individual files, you should only upload one file at a time, then remove it and upload a new file. The upload of multiple files, whereby these are viewed as individual contexts, is implemented in a customized version of the app.

PDF CHAT APP [TEXT-CHUNKS FUNCTION]

The combined character string from the previous function is split into smaller text chunks in the next step using the _"create_textchunks()" function. The maximum number of characters per chunk _"chunksize" is 1000 and the number of characters that can overlap in adjacent chunks _"chunkoverlap" is 200. This implementation enables the app to query and process larger amounts of text more efficiently. Exceeding the input size of the model is prevented. The search is optimized by the split, as smaller, contextualized sections (granularity) can be queried more accurately (more detailed vectors), improving the provision of information overall. The accuracy and processing speed of the model are also increased.

def create_text_chunks(text, chunk_size=1000, chunk_overlap=200):
    """Create text chunks from a large text block."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    text_chunks = text_splitter.split_text(text)
    return text_chunks

PDF CHAT APP [TEXT-EMBEDDING]

An object for text embeddings, numerical representation of text, using the Spacy model must be created to capture the meaning of the text uploaded as a PDF file. The embeddings are then used to vectorize the text in order to store them in a vector store so that they can be used for semantic search.

embeddings = SpacyEmbeddings(model_name="en_core_web_sm")

PDF CHAT APP [VECTOR-STORE-FUNCTION]

The _"vectorstore()" function uses the aforementioned FAISS to store the embeddings of the text chunks. The vector store enables faster retrieval and searching of texts based on the existing embeddings. The vector store is saved locally so that it can be accessed later.

From PDF to vector store *(Image by author)*

def vector_store(text_chunks):
    """Create a vector store for the text chunks."""
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local("faiss_db")

Vectors capture the meaning and context of text by translating sentences and words into a mathematically interpretable space. The text "The weather is nice today" is converted into an example vector [0.25, -0.47, 0.19, …]. This makes it easier to carry out similarity searches.

PDF CHAT APP [CLI BASED LLAMA REQUEST]

The function _"query_llama_viacli()" enables communication with an external LLaMA model process via the command line. Input data is sent, the response is received, processed and any errors that occur are handled errors='ignore'. This function allows the LLM to be implemented in the application workflow, although it runs in a separate environment that is controlled via CLI (Command Line Interface). The advantage of CLIs as a command line interface is that they are platform-independent, which makes the app available on almost any operating system.

def query_llama_via_cli(input_text):
    """Query the Llama model via the CLI."""
    try:
        # Start the interactive process
        process = subprocess.Popen(
            ["ollama", "run", "llama3.1"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,  # Ensure that communication takes place as text (UTF-8)
            encoding='utf-8',  # Set UTF-8 encoding explicitly
            errors='ignore',  # Ignore incorrect characters
            bufsize=1
        )

        # Send the input to the process
        stdout, stderr = process.communicate(input=f"{input_text}n", timeout=30)

        # Check error output
        if process.returncode != 0:
            return f"Error in the model request: {stderr.strip()}"

        # Filter response and remove control characters
        response = re.sub(r'x1b[.*?m', '', stdout)  # Remove ANSI codes

        # Extract the relevant answer
        return extract_relevant_answer(response)

    except subprocess.TimeoutExpired:
        process.kill()
        return "Timeout for the model request"
    except Exception as e:
        return f"An unexpected error has occurred: {str(e)}"

A more detailed explanation of the function follows. The process is started by "subprocess.Popen()". The command to start the LLM is executed ["ollama", "run", "llama3.1"]. Access to the input and output streams of the process (sending data and receiving results) is made possible by the parameters "stdin", "stdout" and "stderr". Communication takes place as UTF-8 encoded text encoding='utf-8′. To improve interactivity, the buffer size for the I/O operations is set to line buffering "bufsize=1".

The input _"inputtext" is transmitted to the process, specifically to the LLM, whereupon a response is generated "stdout". The maximum time (seconds) that is waited until the process has to deliver a response is 30 seconds "timeout=30". If it takes longer, a timeout error is triggered "stderr". The return code checks whether the process was successful "returncode == 0". If this is not the case, an error message is returned. The app does not take as long to return a response. Finally, the response "stdout" is processed. Unwanted characters are removed and ANSI color and formatting codes are removed from the output "response = re.sub(r'x1b[.*?m', ", stdout)". To extract and format the relevant response from the complete module response, _"extract_relevant_answer" is called. The process is terminated with "process.kill()"_ if the timeout of 30 seconds is exceeded. Errors that occur during communication are intercepted and returned as a general error message.

PDF CHAT APP [EXTRACTION OF RELEVANT ANSWERS]

The relevant response is extracted from the entire model response using the _"extract_relevantanswer()" function. At the same time, the function also removes simple formatting problems, specifically spaces at the beginning and end of the composite character string that forms the response "strip()". Depending on the specific requirements of the app, the function can be extended to return certain keywords or sentences (markers). The integration of additional rules for cleansing and formatting is also possible.

def extract_relevant_answer(full_response):
    """Extract the relevant response from the full model response."""
    response_lines = full_response.splitlines()

    # Search for the relevant answer; if there is a marker, it can be used here
    if response_lines:
        # Assume that the answer comes as a complete return to be filtered
        return "n".join(response_lines).strip()

    return "No answer received"

PDF CHAT APP [THE CONVERSATION CHAIN]

The conversation chain is created by the function _"get_conversationalchain()". The function prepares the input for the LLM by combining a specific prompt and the context together with the user's question. The model should be provided with a clear and structured input in order to deliver the best possible answer. A multi-level prompt schema (system message, human message "{input}" and a placeholder) is defined by _"ChatPromptTemplate.frommessage()". The role of the model is defined by the system message "system". The human message then contains the user's question. The prompt (behavior of the model), the context (content of the PDF file) and the question (user of the app) are combined in _"inputtext". The prepared input is sent to the LLM via CLI using the function _"query_llama_via_cli(inputtext)". The output is saved as "response" and displayed in the Streamlit app with "st.write("PDF: ", response)".

def get_conversational_chain(context, ques):
    """Create the input for the model based on the prompt and context."""
    # Define the prompt behavior
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """You are an intelligent and helpful assistant. Your goal is to provide the most accurate and detailed answers 
                possible to any question you receive. Use all available context to enhance your answers, and explain complex 
                concepts in a simple manner. If additional information might help, suggest further areas for exploration. If the 
                answer is not available in the provided context, state this clearly and offer related insights when possible.""",
            ),
            ("human", "{input}"),
            ("placeholder", "{agent_scratchpad}"),
        ]
    )

    # Combine the context and the question
    input_text = f"Prompt: {prompt.format(input=ques)}nContext: {context}nQuestion: {ques}"

    # Request to the model
    response = query_llama_via_cli(input_text)
    st.write("PDF: ", response)  # The answer is displayed here

PDF CHAT APP [USER INPUT PROCESSING]

The user input is processed and forwarded to the LLM using the _"userinput()" function. Specifically, the entire text of the uploaded PDF file is used as the context. The conversation chain function _"get_conversationalchain" is called to finally answer the user's question _"userquestion" using the context "context" . In other words, the interaction between user and model is enabled by this function.

def user_input(user_question, pdf_text):
    """Processes the user input and calls up the model."""
    # Use the entire text of the PDF as context
    context = pdf_text

    # Configure and request
    get_conversational_chain(context, user_question)

PDF CHAT APP [MAIN SCRIPT]

The main logic of the Streamlit web application is defined by the "main()" function. Specifically, the Streamlit page is set up and the layout is defined. Currently, the layout only consists of the option to upload a PDF file _"st.fileuploader" and an input field _"st.textinput". The user's questions are entered in the latter. The user interface enables interaction with the model. If a PDF file is uploaded, it is read _"pdf_text = pdf_read(pdfdoc)". If there is still a question and the input has been confirmed, the request is processed _"user_input(user_question, pdftext)".

def main():
    """Main function of the Streamlit application."""
    st.set_page_config(page_title="CHAT WITH YOUR PDF")
    st.header("PDF CHAT APP")

    pdf_text = ""
    pdf_doc = st.file_uploader("Upload your PDF Files and confirm your question", accept_multiple_files=True)

    if pdf_doc:
        pdf_text = pdf_read(pdf_doc)  # Read the entire PDF text

    user_question = st.text_input("Ask a Question from the PDF Files")

    if user_question and pdf_text:
        user_input(user_question, pdf_text)

    # Monitor RAM consumption
    process = psutil.Process(os.getpid())
    memory_usage = process.memory_info().rss / (1024 ** 2)  # Conversion to megabytes
    st.sidebar.write(f"Memory usage: {memory_usage:.2f} MB")

if __name__ == "__main__":
    main()

The app described is started with the following command and looks like this.

streamlit run pca1.py

Streamlit PDF chat app 1 *Image by author*

INDIVIDUAL CUSTOMIZATION OPTIONS

There are individual customization options in various areas. To improve the quality and relevance of the LLM's responses, the system prompting or the system message, see Conversation chain, can be adapted. The behavior of the model can be directly controlled by specific instructions and context. You can experiment with different prompts to test how the model's responses change. The current system message still offers a lot of potential for customization.

    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """You are an intelligent and helpful assistant. Your goal is to provide the most accurate and detailed answers 
                possible to any question you receive. Use all available context to enhance your answers, and explain complex 
                concepts in a simple manner. If additional information might help, suggest further areas for exploration. If the 
                answer is not available in the provided context, state this clearly and offer related insights when possible.""",
            ),

Another customization option is to replace the currently used LLM Llama3.1. Various models (e.g. gemma2, mistral, phi3, qwen2 etc.) in different sizes (2B, 8B, 70B etc.) are available on OLLAMA. To be able to use a different model, it must first be downloaded and then the function _"query_llama_viacli()" in the Python script must be adapted. Specifically, the command ["ollama", "run", "llama3.1"], which starts the LLM, must be changed.

When using other models, it must be ensured that the available computing power is sufficient for local use. You also need to consider whether the model will be executed on GPUs or CPUs. The following example can be used as a rule of thumb to assess whether a model will work on your own computer. A model with 1 billion parameters (1B) requires around 2 to 3 GB RAM (2–3 bytes per parameter). The Task Manager (Windows) can be used to check to what extent the execution of the application affects the performance of the system.

Windows Task Manager (*Image by author)*

Alternatively, you can also integrate the application's memory usage directly into Streamlit. This is done using the "psutil" library, which must first be installed and then implemented in the Python script.

    # Monitor RAM consumption
    process = psutil.Process(os.getpid())
    memory_usage = process.memory_info().rss / (1024 ** 2)  # Conversion to megabytes
    st.sidebar.write(f"Memory usage: {memory_usage:.2f} MB")

It is also possible to customize the layout and functionality of the application. For example, the main logic, specifically the "main()" function, can be extended to allow the simultaneous upload of multiple PDF files _"accept_multiplefiles=True". The files can then be displayed in a list and selected by the user _"selected_pdffile". Processing then takes place as usual. Depending on the selected file, the extracted context is forwarded to the LLM together with the user's question. The customized code can be found in the file "pca2.py".

Streamlit PDF chat app 2 *(Image by author)*

PCA PYTHON SCRIPT [DOWNLOAD]

Click on the folder to download the zip file with the two apps!

Load the Python scripts by clicking on the folder (*Image by author)*

CONCLUSION

This article showed how Python in combination with tools such as Streamlit, FAISS, Spacy, CLI, OLLAMA and the LLM Llama3.1 can be used to create a web application that allows users to extract text from PDF files locally, save it in the form of embeddings and ask questions about the content of the file using an AI model. By further optimizing the scripts (e.g. prompt), using other models and adapting the layout to include additional functionalities, the app can offer added value in everyday life without incurring additional costs. Have fun customizing and using the app.