Why Does Position-Based Chunking Lead to Poor Performance in RAGs?

Author:Murphy  |  View: 25838  |  Time: 2025-03-23 11:44:39

Neighbors could still be different.

Language models come with a context limit. For newer OpenAI models, this is around 128k tokens, roughly 80k English words. This may sound big enough for most use cases. Still, large production-grade applications often need to refer to more than 80k words, not to mention images, tables, and other unstructured information.

Even if we pack everything within the context window with more irrelevant information, LLM performance drops significantly.

This is where RAG helps. RAG retrieves the relevant information from an embedded source and passes it as context to the LLM. To retrieve the ‘relevant information,' we should have divided the documents into chunks. Thus, chunking plays a vital role in a RAG pipeline.

Chunking helps the RAG retrieve specific pieces of a large document. However, small changes in the chunking strategy can significantly impact the responses LLM makes.

How to Build Helpful RAGs with Query Routing.

There are many different ways to chunk documents. For an in-depth understanding of the different chunking techniques, I suggest reading Han HELOIR, Ph.D. ☕️‘s latest post.

In this post, we'll talk mainly about two specific techniques – the most popular Recursive Character Splitting technique and the Semantic Text Splitting. I believe the Semantic splitting is quite underrated, and we'll find out why.

The widely used chunking technique

Recursive character splitting is a prevalent churning technique. It's a one-liner that offers some decent results. Why would people hate it?

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
Hydroponics is an intelligent way to grow veggies indoors or in small spaces. In hydroponics, plants are grown without soil, using only a substrate and nutrient solution. The global population is rising fast, and there needs to be more space to produce food for everyone. Besides, transporting food for long distances involves lots of issues. You can grow leafy greens, herbs, tomatoes, and cucumbers with hydroponics. 
"""

rc_splits = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=20, chunk_overlap=2
).split_text(text)

for split in rc_splits:
    print(split)
    print("n"*2 + "="*80 + "n"*2)

The above code will print the splits in the screen as follows:

================================================================================

In hydroponics, plants are grown without soil, using only a substrate and nutrient solution. The

================================================================================

The global population is rising fast, and there needs to be more space to produce food for everyone.

================================================================================

everyone. Besides, transporting food for long distances involves lots of issues. You can grow leafy

================================================================================

leafy greens, herbs, tomatoes, and cucumbers with hydroponics.

================================================================================

However, recursive character splitting's ability to accurately represent information sometimes falls short. To understand this, first, let's talk about how recursive character split works.

The recursive character splitter splits the text at an equal token length. However, it also breaks the text when there's a logical boundary, such as a period(.), a question mark (?), or an exclamation mark (!) Because of fixed length splitting, it may or may not capture the entire theme a chunk is supposed to have. Thus, splitting with a window helps. That is, slide the starting point of every chunk not to the end of the previous split but a little earlier. This creates an overlap between the chunks.

Here's how a recursive character splitter with a chunk size of 20 and an overlap widow of 2 would split a text:

Recursive character splitting -Image by the author.

Note that the chunks do not necessarily have a meaning on their own. Take the last chunk, for instance. "leafy greens, herbs, tomatoes, and cucumbers with hydroponics." doesn't have a meaning.

In real apps, you'd choose large chunk sizes, like 1000. You can also pick a large overlap size, like 100, to include as many sentences as possible. Still, you may not be able to capture every related sentence together.

Besides, there's no need for two neighboring sentences to be related at all. They could be about two different things, even if they are in the same paragraph.

Thus, you must understand the linguistic properties of the text you're splitting to arrive at a perfect chunk size and overlap size. That's not always possible. Also, even then, you'll only be optimizing the average length. Your splits might still have partial sentences, sentences not relating to the same topic, or not fully capturing the topic.

This is where semantic splitting comes in handy.

A quick note. Although, in this post, I'm advocating for semantic splitting, I use Recursive Character Splitting extensively in my projects. This is because it's both easy, cost-effective, and sufficiently accurate if you spend a little time tweaking the parameters.

5 Proven Query Translation Techniques To Boost Your RAG Performance

Semantic Text Splitting

Two consecutive sentences may not be related. It makes more sense to split the text at the point where semantic meaning changes significantly.

Semantic text splitting does not depend on a fixed token size. It's a more systematic approach to chunk the text based on its meaning. It can have varying lengths of chunks unless we specify a maximum. And it has complete sentences.

Since semantic splitting chunks the text with unique meaning and it has whole sentences, a sliding window is often unnecessary.

Here's how the same example would look if we used semantic splitting.

Semantic text splitting – Image by the author.

As far as I know, no package that does semantic splitting, so we need to implement it ourselves. But it's also an excellent opportunity to grasp the concepts and make changes accordingly.

Here are the steps involved when splitting documents semantically. But you can make changes and do it in a thousand other ways.

Step I: Split the text into sentences/ start with some initial chunk

Semantic splitting works by adding subsequent sentences until the following sentence's meaning changes significantly. For this, we need to get the text broken into sentences.

You can do this in many ways. A simple regex split could do the job. If not, you can use something like NLTK or huggingface transformer for advanced use cases.

sentences = re.split(r"(?<=[.?!])s+", text)
initial_chunks = [
    {"chunk": str(sentence), "index": i} for i, sentence in enumerate(sentences)
]

Also, to store extra information about each sentence, we make it a list of dictionaries instead of a list of sentences (strings).

Step II: Combining Chunks with Overlapping Sentences

This is an optional but helpful step. Semantic splitting could work better with the neighboring sentences, which are also considered for similarity. But this is only for grouping sentences based on their semantic meaning. It wasn't the case for recursive text splitting, where the final chunk also contains overlapping segments.

# Function to combine chunks with overlapping sentences
def combine_chunks(chunks):
    for i in range(len(chunks)):
        combined_chunk = ""

        if i > 0:
            combined_chunk += chunks[i - 1]["chunk"]

        combined_chunk += chunks[i]["chunk"]

        if i < len(chunks) - 1:
            combined_chunk += chunks[i + 1]["chunk"]

        chunks[i]["combined_chunk"] = combined_chunk

    return chunks

# Combine chunks
combined_chunks = combine_chunks(initial_chunks)

Step III: Create embeddings of each sentence / combined sentence

The next step is to create embeddings for our sentences. If you're using the optional step II, the sentence combination, you can use this field for embedding.

For embedding, I'm using OpenAI's embedding model, but you can use any embedding model of your choice.

# Embed the combined chunks
chunk_embeddings = embeddings.embed_documents(
    [chunk["combined_chunk"] for chunk in combined_chunks]

    # If you haven't created combined_chunk, use the following.
    # [chunk["chunk"] for chunk in combined_chunks]
)

# Add embeddings to chunks
for i, chunk in enumerate(combined_chunks):
    chunk["embedding"] = chunk_embeddings[i]

We've attached the embedding with the chunk. Since we made the initial_chunks a list of dictionaries, this is easy.

Step IV: Calculate semantic distances between the chunks

The next and most crucial step is to calculate the distances. We've created vector representations of our sentences in the previous step. We can now compute the cosine similarity between them. It'll be a number between 0 and 1. We can subtract this number from one to get the distance.

We also store the distance to the following sentence in every chunk in our initial_chunks list.

def calculate_cosine_distances(chunks):
    distances = []
    for i in range(len(chunks) - 1):
        current_embedding = chunks[i]["embedding"]
        next_embedding = chunks[i + 1]["embedding"]

        similarity = cosine_similarity([current_embedding], [next_embedding])[0][0]
        distance = 1 - similarity

        distances.append(distance)
        chunks[i]["distance_to_next"] = distance

    return distances

# Calculate cosine distances
distances = calculate_cosine_distances(combined_chunks)  

The above code calculates the distances between every chunk and its next one. We can now split the chunks wherever there's a spike in the distance.

Step V: Find the chunk indices where the distance to the next has a spike

Finding the spikes is easy. We can use Numpy's percentile calculator to find the threshold_value and get the indices where the distance is above the threshold.

import numpy as np

threshold_percentile = 90

threshold_value = np.percentile(cosine_distances, threshold_percentile)

crossing_points = [
    i for i, distance in enumerate(distances) if distance > threshold_value
]

len(crossing_points)

>> 5

A good idea is to visualize the segments. The code below uses a seaborn chart to show the chunks.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

def visualize_cosine_distances_with_thresholds_multicolored(
    cosine_distances, threshold_percentile=90
):
    # Calculate the threshold value based on the percentile
    threshold_value = np.percentile(cosine_distances, threshold_percentile)

    # Identify the points where the cosine distance crosses the threshold
    crossing_points = [0]  # Start with the first segment beginning at index 0

    crossing_points += [
        i
        for i, distance in enumerate(cosine_distances)
        if distance > threshold_value
    ]

    crossing_points.append(
        len(cosine_distances)
    )  # Ensure the last segment goes to the end

    # Set up the plot
    plt.figure(figsize=(14, 6))
    sns.set(style="white")  # Change to white to turn off gridlines

    # Plot the cosine distances
    sns.lineplot(
        x=range(len(cosine_distances)),
        y=cosine_distances,
        color="blue",
        label="Cosine Distance",
    )

    # Plot the threshold line
    plt.axhline(
        y=threshold_value,
        color="red",
        linestyle="--",
        label=f"{threshold_percentile}th Percentile Threshold",
    )

    # Highlight segments between threshold crossings with different colors
    colors = sns.color_palette(
        "hsv", len(crossing_points) - 1
    )  # Use a color palette for segments
    for i in range(len(crossing_points) - 1):
        plt.axvspan(
            crossing_points[i], crossing_points[i + 1], color=colors[i], alpha=0.3
        )

    # Add labels and title
    plt.title(
        "Cosine Distances Between Segments with Multicolored Threshold Highlighting"
    )
    plt.xlabel("Segment Index")
    plt.ylabel("Cosine Distance")
    plt.legend()

    # Adjust the x-axis limits to remove extra space
    plt.xlim(0, len(cosine_distances) - 1)

    # Display the plot
    plt.show()

    return crossing_points

# Example usage with cosine_distances and threshold_percentile
crossing_poings = visualize_cosine_distances_with_thresholds_multicolored(
    distances, threshold_percentile=bp_threashold
)

The execution of the above will result in an illustrative chart like the below one:

Seaborn chart to illustrate semantic chunking – Image by the author.

Advanced Recursive and Follow-Up Retrieval Techniques For Better RAGs

Putting it all together

We have created semantic chunks of text in five steps. This is undoubtedly much more code than Recursive character splitting. But it comes with its benefits.

Let's see how the two methods work with chunking. I've created a synthetic paragraph about AI's benefits to illustrate my point. Here's how both methods split the text.

Original text:


Artificial Intelligence is transforming various fields, from healthcare, where it enables early diagnosis and personalized treatments, to education, where AI tailors learning experiences to individual needs. In sustainability, AI helps optimize farming, reduce waste, and combat climate change through advanced predictions and resource management. Businesses are leveraging AI not just for automation but also to enhance decision-making and free up human creativity. AI is even making a mark in the creative arts, blending technology with artistic expression in new ways. While challenges exist, by focusing on ethical use, AI can amplify human abilities, improve our lives, and shape a smarter, more connected future.

Recursive Character splitting:

Artificial intelligence is transforming various fields, from healthcare, where it enables early diagnosis and personalized treatments, to education, where AI tailors learning experiences to individual needs. In sustainability, AI helps optimize farming, reduce waste, and combat climate change through

================================================================================

and combat climate change through advanced predictions and resource management. Businesses are leveraging AI not just for automation but also to enhance decision-making and free up human creativity. AI is even making a mark in the creative arts, blending technology with artistic expression in

================================================================================

technology with artistic expression in new ways. While challenges exist, by focusing on ethical use, AI can amplify human abilities, improve our lives, and shape a smarter, more connected future.

================================================================================

As you can see, the recursive splitting breaks the sentence in the middle, and often, there's no one theme behind each chun. Let's see how semantic splitting does it.

Semantic splitting:

Artificial intelligence is transforming various fields, from healthcare, where it enables early diagnosis and personalized treatments, to education, where AI tailors learning experiences to individual needs.In sustainability, AI helps optimize farming, reduce waste, and combat climate change through advanced predictions and resource management.

================================================================================

Businesses are leveraging AI not just for automation but also to enhance decision-making and free up human creativity.AI is even making a mark in the creative arts, blending technology with artistic expression in new ways.While challenges exist, by focusing on ethical use, AI can amplify human abilities, improve our lives, and shape a smarter, more connected future.

================================================================================

Semantic splitting has only two chunks, and the sentences are full. The first one discusses the various fields transformed by AI and sustainability. The second segment discusses how AI transforms businesses.

This is preferable to meaningless, position-based chunking. However, one caveat here is that semantic splitting needs larger text to split. This is because calculating the n-th percentile would be more meaningful when there are more data points. But I guess RAG applications often have a large amount of text, and therefore, semantic splitting is more appropriate.

Final thoughts

Recursive character splitting is the go-to strategy for many. It's easy to think and implement. After all, it's just one line of code.

I need to admit that recursive character splitting works well on most occasions. However, its ability to create meaningful chunks is very limited. You must play around with the chunk size and overlap parameters to find the sweet spot. Also, you may have to do it every time if you have multiple sources.

Semantic chunking systematically breaks the text, so you don't have to worry about the parameters. Of course, you need to set the threshold percentile, but you rarely have problems with this. Semantic chunking will break your text into meaningful segments.

One consideration about semantic chunking is the excessive use of embedding API. You need to be mindful of the cost associated with embedding.

In Recursive character splitting, you chunk mechanically, and then you embed. In semantic splitting, you embed each initial chunk to calculate distances and create actual chunks. You'll need to embed them again after you create the actual chunks to store in the vector store.


Thanks for reading, friend! Say Hi to me on LinkedIn, Twitter, and Medium.

Tags: Artificial Intelligence Data Science Python Retrieval Augmented Tips And Tricks

Comment