Using GPT-3.5-Turbo and GPT-4 to Apply Text-defined Data Quality Checks on Humanitarian Datasets

Author:Murphy | View: 25349 | Time: 2025-03-23 19:08:36

Using GPT-3.5-Turbo and GPT-4 for Predicting Humanitarian Data Categories

Image created by Stable Diffusion with prompt ‘Predicting Cats'.

TL;DR

In this article, I explore using GPT-3.5-Turbo and GPT-4 to categorize datasets without the need for labeled data or model training, by prompting the model with data excerpts and category definitions. Using a small sample of categorized ‘Data Grid' datasets found on the amazing Humanitarian Data Exchange (HDX), zero-shot prompting of GPT-4 resulted in 96% accuracy when predicting category and 89% accuracy when predicting both category and sub-category. GPT-4 outperformed GPT-3.5-turbo for the same prompts, with 96% accuracy versus 66% for category. Especially useful was that the model could provide reasoning for its predictions which helped to identify improvements to the process. This is just a quick analysis involving a small number of records due to cost limitations, but it shows some promise for using Large Language Models for data quality checks and summarization. Limitations exist due to the maximum number of tokens allowed in prompts affecting the amount of data that can be included in data excerpts, as well as performance and cost challenges – especially if you're a small non-profit! – at this early stage of commercial generative AI.

The Humanitarian Data Exchange (HDX) platform has a great feature called the HDX Data Grid which provides an overview of high-quality data coverage in six key crisis categories by country, see here for an example for Chad. The datasets which make it into the grid undergo a series of rigorous tests by the HDX team to determine coverage and quality, the first of which is to determine if the dataset is in an approved category.

I wondered if perhaps Large Language Models (LLMs) might be an efficient way to apply data quality and classification rules in situations where there might not be any labeled training data. It would also be convenient to provide rules in a human-readable text form that non-technical teams could easily maintain, and use these directly in order to eliminate the requirement for features engineering and model management.

Oh, and I also recently got early access to GPT-4 and wanted to take it for a bit of a spin!

Tags: AI Data Engineering Data Science Deep Dives Gpt 4