Monitoring unstructured data for LLM and NLP

Author:Murphy | View: 24664 | Time: 2025-03-23 18:16:08

Once you deploy an NLP or LLM-based solution, you need a way to keep tabs on it. But how do you monitor unstructured data to make sense of the pile of texts?

There are a few approaches here, from detecting drift in raw text data and embedding drift to using regular expressions to run rule-based checks.

In this tutorial, we'll dive into one particular approach – tracking interpretable text descriptors that help assign specific properties to every text.

First, we'll cover some theory:

What is a text descriptor, and when to use them.
Examples of text descriptors.
How to select custom descriptors.

Next, get to code! You will work with e-commerce review data and go through the following steps:

Get an overview of the text data.
Evaluate text data drift using standard descriptors.
Add a custom text descriptor using an external pre-trained model.
Implement pipeline tests to monitor data changes.

We will use the Evidently open-source Python library to generate text descriptors and evaluate changes in the data.

_Code example: If you prefer to go straight to the code, here is the example notebook._

What is a text descriptor?

A text descriptor is any feature or property that describes objects in the text dataset. For example, the length of texts or the number of symbols in them.

You might already have helpful metadata to accompany your texts that will serve as descriptors. For example, e-commerce user reviews might come with user-assigned ratings or topic labels.

Otherwise, you can generate your own descriptors! You do this by adding "virtual features" to your text data. Each helps describe or classify your texts using some meaningful criteria.

By creating these descriptors, you basically come up with your own simple "embedding" and map each text to several interpretable dimensions. This helps make sense of the otherwise unstructured data.

You can then use these text descriptors:

To monitor production NLP models. You can track the properties of your data in time and detect when they change. For example, descriptors help detect text length spikes or drift in sentiment.
To test models during updates. When you iterate on models, you can compare the properties of the evaluation datasets and model responses. For example, you can check that the lengths of the LLM-generated answers remain similar, and they consistently include words you expect to see.
To debug data drift or model decay. If you detect embedding drift or directly observe a drop in the model quality, you can use text descriptors to explore where it comes from.

Examples of text descriptors

Here are a few text descriptors we consider good defaults:

Text length

An excellent place to start is simple text statistics. For example, you can look at the length of texts measured in words, symbols, or sentences. You can evaluate average and min-max length and look at distributions.

You can set expectations based on your use case. Say, product reviews tend to be between 5 and 100 words. If they are shorter or longer, this might signal a change in context. If there is a spike in fixed-length reviews, this might signal a spam attack. If you know that negative reviews are often longer, you can track the share of reviews above a certain length.

There are also quick sanity checks: if you run a chatbot, you might expect non-zero responses or that there is some minimum length for the meaningful output.

Out-of-vocabulary words

Evaluating the share of words outside the defined vocabulary is a good "crude" measure of data quality. Did your users start writing reviews in a new language? Are users talking to your chatbot in Python, not English? Are users filling the responses with "ggg" instead of actual words?

This is a single practical measure to detect all sorts of changes. Once you catch a shift, you can then debug deeper.

You can shape expectations about the share of OOV words based on the examples from "good" production data accumulated over time. For example, if you look at the corpus of previous product reviews, you might expect OOV to be under 10% and monitor if the value goes above this threshold.

Non-letter characters

Related, but with a twist: this descriptor will count all sorts of special symbols that are not letters or numbers, including commas, brackets, hashes, etc.

Sometimes you expect a fair share of special symbols: your texts might contain code or be structured as a JSON. Sometimes, you only expect punctuation marks in human-readable text.

Detecting a shift in non-letter characters can expose data quality issues, like HTML codes leaking into the texts of the reviews, spam attacks, unexpected use cases, etc.

Sentiment

Text sentiment is another indicator. It is helpful in various scenarios: from chatbot conversations to user reviews and writing marketing copy. You can typically set an expectation about the sentiment of the texts you deal with.

Even if the sentiment "does not apply," this might translate to the expectation of a primarily neutral tone. The potential appearance of either a negative or positive tone is worth tracking and looking into. It might indicate unexpected usage scenarios: is the user using your virtual mortgage advisor as a complaint channel?

You might also expect a certain balance: for example, there is always a share of conversations or reviews with a negative tone, but you'd expect it not to exceed a certain threshold or the overall distribution of review sentiment to remain stable.

Trigger words

You can also check whether the texts contain words from a specific list or lists and treat this as a binary feature.

This is a powerful way to encode multiple expectations about your texts. You need some effort to curate lists manually, but you can design many handy checks this way. For example, you can create lists of trigger words like:

Mentions of products or brands.
Mentions of competitors.
Mentions of locations, cities, places, etc.
Mentions of words that represent particular topics.

You can curate (and continuously extend) lists like this that are specific to your use case.

For example, if an advisor chatbot helps choose between products offered by the company, you might expect most of the responses to contain the names of one of the products from the list.

RegExp matches

The inclusion of specific words from the list is one example of a pattern you can formulate as a regular expression. You can come up with others: do you expect your texts to start with "hello" and end with "thank you"? Include emails? Contain known named elements?

If you expect the model inputs or outputs to match a specific format, you can use regular expression match as another descriptor.

Custom descriptors

You can extend this idea further. For example:

Evaluate other text properties: toxicity, subjectivity, the formality of the tone, readability score, etc. You can often find open pre-trained models to do the trick.
Count specific components: emails, URLs, emojis, dates, and parts of speech. You can use external models or even simple regular expressions.
Get granular with stats: you can track very detailed text statistics if they are meaningful to your use case, e.g., track average lengths of words, whether they are upper or lower case, the ratio of unique words, etc.
Monitor personally identifiable information: for example, when you do not expect it to come up in chatbot conversations.
Use named entity recognition: to extract specific entities and treat them as tags. ‍
Use topic modeling to build a topic monitoring system. This is the most laborious approach but powerful when done right. It is useful when you expect the texts to stay mostly on-topic and have the corpus of previous examples to train the model. You can use unsupervised topic clustering and create a model to assign new texts to known clusters. You can then treat assigned classes as descriptors to monitor the changes in the distribution of topics in the new data.

Here are a few things to keep in mind when designing descriptors to monitor:

It is best to stay focused and try to find a small number of suitable quality indicators that match the use case rather than monitor all possible dimensions. Think of descriptors as model features. You want to find a few strong ones rather than generate a lot of weak or unhelpful features. Many of them are bound to be correlated: language and share of OOV words, length in sentences and symbols, etc. Pick your favorite!
Use exploratory data analysis to evaluate text properties in existing data (for example, logs of previous conversations) to test your assumptions before adding them to model monitoring.
Learn from model failures. Whenever you face an issue with production model quality that you expect to reappear (e.g., texts in a foreign language), consider how to develop a test case or a descriptor to add to detect it in the future.
Mind the computation cost. Using external models to score your texts by every possible dimension is tempting, but this comes at a cost. Consider it when working with larger datasets: every external classifier is an extra model to run. You can often get away with fewer or simpler checks.

Step-by-step tutorial

To illustrate the idea, let's walk through the following scenario: you are building a classifier model to score reviews that users leave on an e-commerce website and tag them by topic. Once it is in production, you want to detect changes in the data and model environment, but you do not have the true labels. You need to run a separate labeling process to get them.

How can you keep tabs on the changes without the labels?

Let's take an example dataset and go through the following steps: