How to Achieve Near Human-Level Performance in Chunking for RAGs

Author:Murphy | View: 23148 | Time: 2025-03-23 11:42:01

Good chunks make good RAGs.

Chunking, embedding, and indexing are critical aspects of RAGs. A RAG app that uses the appropriate chunking technique performs well in terms of output quality and speed.

When engineering an LLM pipeline, we use different strategies to split the text. Recursive character splitting is the most popular technique. It uses a sliding window approach with a fixed token length. However, this approach does not guarantee that it can sufficiently hold a theme within its window size. Also, there's a risk that part of the context falls into different chunks.

The other technique I love is semantic splitting. Semantic splitting breaks the text whenever there's a significant change between two consecutive sentences. It has no length constraints. So, it can have many sentences or very few. But it's more likely to capture the different themes more accurately.

Even the semantic splitting approach has a problem.

What if sentences far from each other are closer in their meaning?

Semantic splitting fails to address this issue. If someone talks about politics, suddenly talks about climate change, and then comes back to talk about politics again, the semantic split may create three chunks. If you or I manually chunk it, we'd create only two – one for politics and the other for climate change.

If you'd like to learn how we implement semantic splitting in Python, here's a post I recently wrote on the topic.

Why Does Position-Based Chunking Lead to Poor Performance in RAGs?

Most of our spontaneous thinking is like that. We pivot in different directions and return to the topic again. Thus, chunking such documents inevitably requires a smarter approach.

If you're trying to embed transcripts of people talking, podcasts, or whenever you think you don't know the content beforehand to make educated assumptions, neither recursive character splitting nor semantic splitting would help.

This is why we need an agentic approach to chunking. An agent that could act like us and create chunks more actively.

How Does Agentic Chunking Work?

In Agentic chunking, an LLM processes every sentence in a passage and allocates it to a chunk with similar sentences or creates one if no chunk matches it.

We use agentic chunking to address the limitations of recursive character and semantic splitting. It works because instead of relying on fixed token length or change in the semantic meaning, it actively evaluates every sentence and assigns it to a chunk. Because of this, agentic chunking can group two related sentences in the document even if they are far from each other.

But for this to work, the two must be complete sentences independently. Consider the following example.

On July 20, 1969, astronaut Neil Armstrong walked on the moon . He was leading the NASA's Apollo 11 mission. Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.

In agentic chunking, sentences passed to the agent have no contextual information. In the above passage, the second sentence, "He was leading NASA's Apollo 11 mission," does not reference its previous sentence. Thus, the LLM couldn't figure out who the "He" was in this sentence. Therefore, this passage should be converted to something like the following.

On July 20, 1969, astronaut Neil Armstrong walked on the moon .

Neil Armstrong was leading the NASA's Apollo 11 mission.

Neil Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.

This process is popularly known as propositioning.

Now, the LLM can individually check every sentence and allocate it to a chunk or create one if it is irrelevant. This is possible because every sentence has a subject.

Implementing Agentic chunking

Now, we have a rough idea of how agentic chunking works. We also know that the sentences need to be propositioned for it to work.

However, there are many different ways to implement this; no single package does it for us.

I frequently use Greg Kamradt's GitHub repo. But Greg's code is one of a million ways to implement agentic splitting. If you're interested in learning more about chunking techniques, Greg has created an excellent tutorial. I urge you to check it out here.

Before we dive into the implementation steps, Here's our initial setup.

Let's start with propositioning.

Propositioning the text

As we now understand propositioning, we can create our own prompt to let an LLM do this for us. Fortunately, an excellent prompt is hosted in the Langchain hub.

Let's pull the prompt template, create an LLM chain, and test it.

obj = hub.pull("wfh/proposal-indexing")

# You can explore the prompt template behind this by running the following:
# obj.get_prompts()[0].messages[0].prompt.template

llm = ChatOpenAI(model="gpt-4o")

# A Pydantic model to extract sentences from the passage
class Sentences(BaseModel):
    sentences: List[str]

extraction_llm = llm.with_structured_output(Sentences)

# Create the sentence extraction chain
extraction_chain = obj | extraction_llm

# Test it out
sentences = extraction_chain.invoke(
    """
    On July 20, 1969, astronaut Neil Armstrong walked on the moon . 
    He was leading the NASA's Apollo 11 mission. 
    Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.
    """
)

>>['On July 20, 1969, astronaut Neil Armstrong walked on the moon.',
 "Neil Armstrong was leading NASA's Apollo 11 mission.",
 'Neil Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.']

The above code uses a Pydantic model to extract sentences. This is the recommended way to pull structured outputs from a text.

But in a large text, we can't do this very effectively. The "he" in one sentence may refer to Neil Armstrong, but in a different paragraph, ‘he' may refer to Alexander Graham Bell. It depends on what the file is about.

Thus, a good idea is to split the text into paragraphs and do propositioning within each paragraph.

paragraphs = text.split("nn")

propositions = []

for i, p in enumerate(paragraphs):
    propositions = extraction_chain.invoke(p

    propositions.extend(propositions)

The above code snippet will create a list of propositions within each paragraph's context.

Create chunks using an LLM Agent.

Now, with proportioning, we have individual sentences; they speak for themselves. The document is ready for an agent to process.

The agent does a few things here.

The agent begins with an empty dictionary called chunks, where it stores all the chunks it creates. Each chunk contains propositions that share a similar theme. The goal of the agent is to group these propositions into chunks in the following format:

{
    "12345": {
        "chunk_id": "12345",
        "propositions": [
            "The month is October.",
            "The year is 2023."
        ],
        "title": "Date & Time",
        "summary": "This chunk contains information about dates and times, including the current month and year.",
    },
    "67890": {
        "chunk_id": "67890",
        "propositions": [
            "One of the most important things that I didn't understand about the world as a child was the degree to which the returns for performance are superlinear.",
            "Teachers and coaches implicitly told us that the returns were linear.",
            "I heard a thousand times that 'You get out what you put in.'"
        ],
        "title": "Performance Returns",
        "summary": "This chunk contains information about performance returns and how they are perceived differently from reality.",
    }
}

As it encounters a new proposition, the agent either adds it to an existing chunk or creates a new chunk if no suitable one is found. The decision on whether an existing chunk matches is based on the incoming proposition and the chunk's current summary.

Additionally, if new propositions are added to a chunk, the agent can update the chunk's summary and title to reflect the new information. This ensures that the metadata stays relevant as the chunk evolves.

Let's code them step by step.

Step I: A function to create chunks

When we start for the first time, there aren't any chunks. So, we must create a chunk to store our first proposition. Not only that. We need a function to create chunks whenever the agent decides there needs to be a new chunk for a proposition. Thus, let's define a function to do this.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

chunks = {}

def create_new_chunk(chunk_id, proposition):
    summary_llm = llm.with_structured_output(ChunkMeta)

    summary_prompt_template = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "Generate a new summary and a title based on the propositions.",
            ),
            (
                "user",
                "propositions:{propositions}",
            ),
        ]
    )

    summary_chain = summary_prompt_template | summary_llm

    chunk_meta = summary_chain.invoke(
        {
            "propositions": [proposition],
        }
    )

    chunks[chunk_id] = {
        "summary": chunk_meta.summary,
        "title": chunk_meta.title,
        "propositions": [proposition],
    }

We store chunks outside the function since it is updated by this and the other functions many times.

In the above code, we used an LLM to generate a title and a summary for our chunk. This is a summary of the first proposition on our list.

Step II: A function to add a proposition to a chunk

As we scan through the document, every subsequent chunk needs to be added to a chunk. When we add a chunk, the title and summary may or may not accurately reflect its content. Thus, we re-evaluate them and rewrite them if necessary.

from langchain_core.pydantic_v1 import BaseModel, Field

class ChunkMeta(BaseModel):
    title: str = Field(description="The title of the chunk.")
    summary: str = Field(description="The summary of the chunk.")

def add_proposition(chunk_id, proposition):
    summary_llm = llm.with_structured_output(ChunkMeta)

    summary_prompt_template = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "If the current_summary and title is still valid for the propositions return them."
                "If not generate a new summary and a title based on the propositions.",
            ),
            (
                "user",
                "current_summary:{current_summary}nncurrent_title:{current_title}nnpropositions:{propositions}",
            ),
        ]
    )

    summary_chain = summary_prompt_template | summary_llm

    chunk = chunks[chunk_id]

    current_summary = chunk["summary"]
    current_title = chunk["title"]
    current_propositions = chunk["propositions"]

    all_propositions = current_propositions + [proposition]

    chunk_meta = summary_chain.invoke(
        {
            "current_summary": current_summary,
            "current_title": current_title,
            "propositions": all_propositions,
        }
    )

    chunk["summary"] = chunk_meta.summary
    chunk["title"] = chunk_meta.title
    chunk["propositions"] = all_propositions

The above function will add the proposition to an existing chunk. Again, we use another LLM call to decide whether to change the title and the summary. To make life easier, we configure the LLM with a pedantic model so that the output is now a structured object rather than a random text.

Step III: An agent that pushes the proposition into the relevant chunk

Although the two functions we defined above do the work, they don't know if an existing chunk is good enough to hold the incoming proposition. If there's one, we must call the add_proposition function with the respective chunk_id and the proposition. If not, we need to call the create_new_chunk function to make one.

The following function does it.

def find_chunk_and_push_proposition(proposition):

    class ChunkID(BaseModel):
        chunk_id: int = Field(description="The chunk id.")

    allocation_llm = llm.with_structured_output(ChunkID)

    allocation_prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You have the chunk ids and the summaries"
                "Find the chunk that best matches the proposition."
                "If no chunk matches, return a new chunk id."
                "Return only the chunk id.",
            ),
            (
                "user",
                "proposition:{proposition}" "chunks_summaries:{chunks_summaries}",
            ),
        ]
    )

    allocation_chain = allocation_prompt | allocation_llm

    chunks_summaries = {
        chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()
    }

    best_chunk_id = allocation_chain.invoke(
        {"proposition": proposition, "chunks_summaries": chunks_summaries}
    ).chunk_id

    if best_chunk_id not in chunks:
        best_chunk_id = create_new_chunk(best_chunk_id, proposition)
        return

    add_proposition(best_chunk_id, proposition)

The above function uses an LLM to decide whether to create a new chunk or push the proposition to an existing one. It also calls the respective functions to do this.

This is simplified code and an explanation for agentic chunking. You need to make many design decisions along the way. For instance, you could do a similarity search instead of letting an LLM compare the summary and the incoming proposition.

Final thoughts

Agentic chunking is undoubtedly a powerful technique for chunking documents. Like semantic chunking, it creates meaningful chunks. But it goes a step further by adding sentences for a theme even when they are far apart.

The primary consideration is the number of LLM calls it makes—every call costs money and adds latency to the process. Agentic chunking is known to be a slow and costly chunking strategy. Thus, you must be mindful of your budget for large projects.

Tags: Artificial Intelligence Data Science Hands On Tutorials Python Retrieval Augmented