From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models

Author:Murphy | View: 24933 | Time: 2025-03-23 18:25:21

Image by Dall-E 2. Generated and modified by the author.

In the digital age, accurate and reliable data is paramount for businesses striving to deliver personalized experiences and make informed decisions[1]. However, the sheer volume and complexity of data often pose significant challenges requiring many hours of tedious and manual work. Enter the game-changing technology of Large Language Models (LLMs). These advanced AI tools, with their natural language processing capabilities and pattern recognition, have the potential to revolutionize the process of cleansing data to make it more usable.

Among the wrenches and the screwdrivers in the data scientists' tool chest are the LLMs, reshaping activities and harnessing powers to enhance data quality. The proverbial whack of a hammer will unlock actionable insights and ultimately pave the way for better customer experiences.

That said, let's drill right into the use case that I will be using as an example today.

The Use Case

The worst thing you could do when conducting a survey among students is to leave a factual field as free-form text! You can imagine some of the responses we got.

Jokes aside, one of our clients, Study Fetch, an AI-powered platform that uses course material to create personalized all-in-one study sets for students, conducted a survey among university students. After receiving a whopping 10K+ responses, their CEO and Co-Founder, Esan Durrani, stumbled upon a little hiccup. Turns out, the "major" field in the survey was a free-form text box, meaning respondents could type in whatever they pleased. Now, as data scientists, we know that's not the brightest move if you want to crunch some statistical numbers. So, the raw data from the survey ended up looking like this…

Anthropology 
Chem E 
Computer Science
Business and Law
Drama
cs
IMB

Oh my! Get ready to grab your Excel and embark on a sorting adventure that might take you a mere hour or, who knows, maybe even three. Only then will this data heresy be thoroughly purged.

Yet, fear not, as we have the hammer of the Large Language Model (Llm).

As a wise man once said, if a hammer is all you have, everything looks like a nail. And boy, doesn't the data cleansing job look like the most perfect nail?

We can simply ask our friendly neighborhood LLM to classify these into known majors. Specifically, OpenAI's Generative Pre-trained Transformers (GPT), an LLM that powers the popular Chatbot app ChatGPT, will work for this case. GPT models use upwards of 175 billion parameters and have been trained on 2.6 billion stored web pages scraped from Common Crawl, an open dataset. Additionally, through a technique known as reinforcement learning from human feedback (RLHF), trainers can nudge and prod the model into providing more accurate and useful responses. [2]

I think for our purpose, 175 billion+ parameters, should do just fine. As long as we are able to come up with the right prompt.

It's all in the Prompt

Ryan and Esan, from the AI company whose bread-and-butter is writing great prompts, proposed the first version of our prompt. It was a great one and did work very well using language inference[3], but there were two things that could be improved:

It was written to work for one record
It was written as a ‘Completion' using the Da Vinci Model (My bank account recoiled in fear at the mere mention of IT)

It would cost us too much and that simply wasn't going to do. So, Ryan and I independently rewrote the prompt as a chat prompt using ‘gpt-3.5-turbo' to perform bulk action. OpenAI's prompt best practices and the course ChatGPT Prompt Engineering for Developers came in handy for me. We went through a few more iterations of ideate, implement, analyze, and reform and we had a good working version.

Without further ado, here is the prompt after revision 2:

1. Arts and Humanities: Literature and arts.
2. Social Sciences: psychology, sociology, and political science
3. Business and Management: marketing, finance, economics, MBA
4. Life Sciences: Biology, environmental,agriculture-related
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school
7. All Other Programs: Everything else.

The above information pertains to categories of undergraduate programs. 
Categorize the below data into one of the program categories above.

Desired Format: CSV of data row, program name

Data
###
Drama
Literature
Physics
Health sciences
Mechanical Engineering
###

Response from the LLM for this Prompt was

Drama, Arts and Humanities: Literature and arts
IMB, All Other Programs: Everything else
Health sciences, Medicine Health and Allied Sciences: Nursing, medical school
MBA, Business and Management: marketing, finance, economics, MBA
Mechanical Engineering, Engineering and Technology: Electrical, mechanical, etc.

This will work, sort of. But I didn't quite like the redundant text with the long program names repeating. With LLMs, text is tokens and tokens cost money. You see, my programming skills were forged in the fiery depths of the Dot Com Bust. And let me tell you, I never pass up an opportunity for some cost-saving optimizations.

So, I changed the prompt slightly in the ‘Desired Format' section. I asked the model to output just the ordinal number of the survey responses (E.g. 1 for Drama above) and the ordinal number of the program (E.g. 1 for Literature). Then Ryan suggested I should request a JSON output instead of CSV to make it simpler to parse. He also recommended I add an ‘example output' section, an excellent suggestion.

The final prompt is as follows (simplified for clarity):

1. Arts and Humanities: Literature and arts.
2. Social Sciences: psychology, sociology, and political science
3. Business and Management: marketing, finance, economics, MBA
4. Life Sciences: Biology, environmental,agriculture-related
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school
7. All Other Programs: Everything else.

The above information pertains to categories of undergraduate programs. 
Categorize the below data into one of the program categories above.

Desired Format: JSON format of an array with response number, program number

Example Example output:  [[1,7],[2,1],[3,5],[4,8],[5,7]]

Data
###
1. Drama
2. IMB
3. Health sciences
4. MBA
5. Mechanical Engineering
###

The full prompt we used can be viewed on the GitHub link here.

The output from the model:

[
  [1, 1]
  [2, 3]
  [3, 6]
  [4, 3]
  [5, 5]
]

So, as discussed earlier, the output from the model is a mapping between the ordinal numbers of survey responses and the categories we defined. Take the first line for example: 1,1. That means 1 is the response number and 1 is the corresponding mapped program number. Survey response 1 is "Drama" and the mapped program 1 is "Arts and Humanities". This seems right! Drama in its proper #1 place, all eyes on it.

While the output at first glance looks like the output of embeddings (used in clustering and dimensionality reduction), they are simply the same mapped information with just the ordinal positions. In addition to providing some cost benefits on token usage, the numbers are easier to parse.

We can now translate the original survey response in the file into meaningful majors, do aggregation, and gain valuable actionable insights.

But wait, I'm not going to sit in front of my computer, type each block of survey responses into the browser and calculate the mappings. Besides being mind-numbing, the error rate would simply not do.

What we need is some good old automation. Enter the API …

API to the Rescue

As you may be aware, the Application Programming Interface (API) allows our program to interact with third-party services efficiently. While many people are accomplishing impressive feats with ChatGPT, the real potential of Language Models lies in utilizing the API to seamlessly integrate natural language capabilities into an application, making it imperceptible to the users. Much like the incredible science and technology that goes into making the phone or computer you are using to read this article on.

If you don't already have it, you can request access to the API [here](https://platform.openai.com/examples), https://openai.com/blog/openai-api [4]. Once you sign up and get your API key, the specification can be found here. Some really helpful examples with code samples can be found here. The playground is a nice feature to test the prompt with various settings before you put it in [5].

We will be using the chat completion API using REST. A sample payload of the call is as follows:

{ 
  "model": "gpt-3.5-turbo", 
  "temperature": 0,
  "n": 1,
  "messages": [
     {"role": "user", "content": "Hello, I'm a nail."}
  ]
}

Let's take a quick look at the parameters and their effects

model

The only one open to the public at this point for chat completions is gpt-3.5-turbo. Esan had access to GPT 4 model, which I was very jealous about. While gpt-4 is more accurate and hallucinates less [2], it is roughly 20 times more expensive and for our needs, Mr. Turbo was quite adequate, thank you.

temperature

Next to the prompt, the temperature is one of the most important settings we can pass to the model. It can be set to a value between 0 and 2, as per the API docs. It has a significant impact [6] as it controls how much randomness is in the output, sort of like the amount of caffeine in your system before you start writing. A guide to values you can use for each application is given here [7]

For our use case, we simply want no variations. We want the engine to give us mappings as is and the same ones every single time. So, we used a value of 0.

How many chat completions choices to generate? If we were doing this for creative writing and wanted more than 1 choice to select from, we can use 2 or even 3. For our case n=1 (default) will work well.

message

The role can be system, user, or assistant. The system role provides instructions and sets the context. The user role represents the prompt from the end user. The assistant role is the responses based on the conversation history. These roles help structure conversations and enable effective interaction between users and the AI assistant.

MODEL MAX TOKENS

This isn't necessarily a parameter we pass in the request, though another parameter called max_tokens limits the total length of the response from the chat.

Firstly, a token can be thought of as a piece of a word. One token is approximately 4 characters in English. For example, the quote "The best way to predict the future is to create it" attributed to Abraham Lincoln and others, contains 11 tokens.

Image from Open AI Tokenizer. Generated by the Author.

If you are thinking that a token is exactly a word, here is another example of 64 tokens, to show it isn't all that straightforward.

Brace yourself for a shocking revelation: every emoji you include in your message adds a hefty toll of up to 6. That's right, your beloved smileys and winks are sneaky little token thieves!

Tags: Data Cleaning Editors Pick Large Language Models Llm