Parsing Irregular Spreadsheet Tables in Humanitarian Datasets (with Some Help from GPT-3)

Author:Murphy  |  View: 24439  |  Time: 2025-03-23 19:39:43
Created by DALL-E2 with prompt "A painting of 10 wood tables". There are 9 tables in the image above.

TL;DR

As part of a previous study using data from the Humanitarian Data Exchange, I had to analyze thousands of Excel files where tables within those files were often difficult to parse into database tables. Irregular layouts with merged cells, hierarchical columns, and annotations are difficult to anticipate with rule-based parsing when files originate from hundreds of organizations across the world. In this article, I explore using GPT-3 zero- single- and single-shot with reasoning completion to reformat irregular (small) tables, as well as fine-tuning the model to predict table attributes which can then be used for accurate parsing.

There have been quite a few times on my travels when I've needed to review a large number of Excel files to understand what data they contain, how well it is structured, and the work required to clean it into a form where we can get to the juicy stuff like training models. For the most part this is fairly straightforward, as long as the data is regular with nice neat column headings. However, life is never that easy and it's often the case the tables in these files can be in a less-than-perfect format to parse into neat data frames that can be uploaded into relational databases. Excel supports a lot of features such as pivot tables and cell merging, which human beings use to create a wide variety of layouts, with blank rows, random text here and there, and more!

Here is an example to illustrate …

Example of an irregular table in Excel, with blank top rows, labels and merged cells. Perfectly readable for humans, but a challenge to parse for data science. This file was sourced from the Humanitarian Data Exchange

If we read the above file directly into Pandas …

import pandas as pd
df = pd.read_excel(filename)
df = df.fillna("")
display(df)

We get this …

Example of Pandas dataframe after parsing a table on an Excel sheet, where there are blank rows and merged cells to indicate hierarchical columns. Example data from the Humanitarian Data Exchange

Loading this into a database would result in near-unusable data because …

  1. There is a table title in the top-right cell
  2. Column ‘Unnamed: 1' title is actually what's in the first column row 5 "What is the average size of land you own that …"
  3. Columns ‘Unnamed:2' and ‘Unnamed:3' are aggregate totals split into ‘N' numeric and ‘%' percentage values
  4. Most columns are hierarchical, with merged cells above unmerged cells

It's not that bad, right?

It is of course possible to provide parameters to Pandas read_excel that will convert hierarchical columns to indexes, which can then be collapsed into a single row. Alternatively, we might manipulate in Openpxyl using information from Excel itself about merged cells. However, these methods require knowledge of the table – specifically where the headings finish and the data starts and how hierarchical columns are structured – a luxury we might not always have if processing thousands of spreadsheets. Maintaining rule-based parsing for large volumes of files can be time-consuming and brittle, requiring continued maintenance as new layouts appear on the scene.

As it happens, I am not alone! Parsing irregular tables is a challenge being actively researched. For example, Microsoft authors have shown some great results using Convolutional Neural Networks to develop an algorithm called ‘TableSense' [1]. This technique treats Excel sheets in a similar way to images but with richer featurization as each cell can have a range of attributes and data types, as well as formatting and merging characteristics. Very cool. I hope fantastic work like this will be included in Microsoft's products soon, but until then I wanted to explore some other approaches.

It's also worth noting that my use-case is not just to identify the range in a sheet where the table is (see training data for the Microsoft paper above), but elements in the table so irregular formats can be converted to something that can be easily imported into a database. The main challenge is hierarchical columns in Excel, flattening these into a single row that captures information from overlying merged cells. Sounds simple to fix, but the challenge is: where do the headings stop and the data start? This is obvious to us humans, but it's surprising how something so simple can be quite noisy in the real world when processing sheets using code.

Given all the recent attention for generative AI and Large Language Models (LLMs), I wondered if perhaps OpenAI's GPT-3 might be up to the challenge. These models are trained on huge amounts of data extracted from the internet, which includes tables and CSV files, so they might be useful in handling some of the nuances of tables put together by us crazy humans.

Prompting GPT-3 to Tidy Up (a Small) Table

We will first try to solve our problem as zero- and few-shot tasks for GPT-3, before moving on to using fine-tuning techniques.

Zero-shot, one-shot and few-shot tasks, contrasted with traditional fine-tuning. The panels above show four methods for performing a task with a language model. From Brown et al [2]

GPT-3 is trained on text scraped from the web, so we cannot prompt it with Excel (yet!), therefore we first have to convert our sheet into a form that is occurs on the web, CSV string …

df = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
df = df.fillna("")
print(df.to_csv())
,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,Table 3: Number of acreage under irrigation,,,,,,,,,,,
1,,,OVERALL,,Sub county,,,,,,,
2,,,,,Chepalungu,,,,Bomet Central,,,
3,,,,,Male,,Female,,Male,,Female,
4,,,N,%,N,%,N,%,N,%,N,%
5,What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
6,,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
7,,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
8,,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
9,,None,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%
10,,Total,789,100.0%,182,100.0%,263,100.0%,173,100.0%,171,100.0%

Side note: I also tried with Markdown and HTML tables, but got best results for my use-case with CSV.

It's worth noting that for this analysis the tables we are dealing with are thin, ie having < 100 columns. This means the first 10 rows can be represented easily in a GPT-3 prompt. This is fine for most of the Excel tables I have been analyzing for the Humanitarian Data Exchange, but might not extend to other scenarios. Also, this analysis doesn't consider cases where there are multiple tables on the same Excel sheet … that is for a later blog post.

Tags: Data Extraction Deep Dives Gpt 3 Hdx Humanitarian

Comment