How to Effortlessly Extract Receipt Information with OCR and GPT-4o mini

Author:Murphy  |  View: 25747  |  Time: 2025-03-23 11:46:31

In this article, I will show you how to extract information from receipts, giving a simple image of the receipt. First, we will utilize OCR to extract information from the receipt. This information will then be sent to the GPT-4o mini model for information extraction. My goal for this project is to develop an application that can help split a bill among friends simply by taking an image of the receipt and selecting which items belong to which person. This article will focus on the information extraction part of this goal.

Extract information from receipts using OCR and GPT-4o mini. Image by ChatGPT. OpenAI. (2024). ChatGPT (4o) [Large language model]. https://chatgpt.com/c/c567fd8c-1955-4af9-8566-0a9393e970e5

The application developed in this article can be accessed on Google Play.

Motivation

It's a hassle to go through receipts and calculate everyone's share, for example, after visiting restaurants. I have encountered this problem numerous times and therefore wanted a solution to make the process more effective. I therefore thought of the BillSplitter application. The idea is that a user can take an image of a receipt, the application will utilize OCR and language models to process the receipt and extract each item and the corresponding price, and the user can simply select which person should pay for which item. Ultimately, the user receives an overview of how much each person owes. This article shows how you can develop the receipt processing part of this application, while the frontend part will be left for another article. This means that this article will assume you have an image of a receipt, and the goal is to extract each item from the receipt with its corresponding price in a list. Later, a front-end application can be developed to utilize what we will develop in this article.

Table of contents

· Motivation · Table of contents · Extracting text from receipts · Information extraction using GPT-4o mini · Reviewing resultsFirst receiptSecond receiptThird receipt · Conclusion

The pipeline used in this article. You start off with an image of a receipt and extract the text from it. You will then use GPT-4o mini to use the OCR output to extract each item and its corresponding price present on the receipt. Image by the author.

Extracting text from receipts

First, the text from receipts can be extracted using an OCR engine. There are countless open-source OCR engines, but this project will utilize EasyOCR because of its effectiveness and ease of use.

First, you must import the required packages:

import easyocr
import cv2
import matplotlib.pyplot as plt
import requests
import torch
import pytesseract
from PIL import Image
import json
import numpy as np

Which you can install with pip. Note that to run EasyOCR on GPU (which is highly recommended since it saves a lot of time), I had to install PyTorch on GPU explicitly with:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

I used this on Windows and CUDA 11.8, but if you use another OS or CUDA version, the correct command is on the PyTorch website.

I then took some images of receipts. You are welcome to use your own receipts, but if you don't want to, you can use my receipts on Google Drive, which are from Norwegian supermarkets. I call the folder containing images of receipts ReceiptData and load the image paths into variables with:

img_path1 = "ReceiptData/20231016_180324.jpg"
img_path2 = "ReceiptData/20231010_210904.jpg"
img_path3 = "ReceiptData/20231014_182753.jpg"
img_path4 = "ReceiptData/20230917_131726.jpg"
img_path5 = "ReceiptData/20231002_190427.jpg"

I also like to have a variable for which receipt I am using, for readability and easy changing between receipts.

PATH_TO_USE = img_path2

To load EasyOCR, you can then use the line below. Note that if you are reading text from different languages, you can change no to your desired language (for example en for English). You can find a list of all available languages and their code names on this website. Additionally, if you are not using GPU, you can set gpu=False.

reader = easyocr.Reader(['no'], gpu=True) # this needs to run only once to load the model into memory

I then read the image, converted it to greyscale to improve OCR performance, and ran the OCR. Additionally, I also combine the OCR output into one string.

img = cv2.imread(PATH_TO_USE)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
result = reader.readtext(img, detail=0)

result_string = ""
for ele in result:
 result_string += ele + " "

If you want to view the receipt, you can do so with:

cv2.namedWindow('img', cv2.WINDOW_NORMAL)
cv2.imshow("img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Congrats, you have now successfully extracted text from a receipt. Note that some of the OCR output might look like nonsense, for example the beginning of the OCR output for one receipt looks like: Al rema 10@@ Salaskvittering REMA 1ooo GausdAL, which is naturally not optimal. This happens for several reasons, but the main one is that the text in the receipt is quite small, making it difficult for the OCR to make out all the characters. Still, however, as you will see later in this article, the OCR is effective enough to extract items from the receipts successfully. It should be noted, however, that improving OCR quality should be a priority for future work.

Information extraction using GPT-4o mini

I have spent a lot of time trying to extract information from receipts using logic in Python, Regex, and similar approaches. The problem with these approaches, however, is that they struggle to handle the diverse output that can be output from the OCR. Large language models like Gpt-4o mini, however, have revolutionized this process, with both being able to handle diverse input and output a structured response. This makes large language models perfect for the task we want to achieve here, namely extracting items and prices from the OCR output of a receipt. For this article, I will be using GPT-4o mini, though many other LLMs are also viable to use in this situation.

First, you should log in or create an account at OpenAI. This will give you the API keys you need to access the OpenAI API. You also have to input a payment method to pay for the API request. I recommend putting spending limits on your account as soon as possible as one can easily start spending a lot of money. However, using GPT-4o mini on receipt data is quite cheap, and you can get a full overview of the pricing on OpenAI's website.

You should then store the API securely. There are several ways to do this, but I simply create a constants.py file that looks like this:

OPEN_AI_API_KEY = "123123123"

You can then import the key into a separate file with the following code:

from constants import OPEN_AI_API_KEY
OPEN_AI_API_KEY = str(OPEN_AI_API_KEY)
assert OPEN_AI_API_KEY.startswith("sk-") and OPEN_AI_API_KEY.endswith("123")
client = OpenAI(api_key=OPEN_AI_API_KEY)

Remember to replace the two strings in the assert to match your API key. This assert statement ensures you are using the correct API key.

You can then create an OpenAI client to make API requests with:

from openai import OpenAI
import os
client = OpenAI(api_key=OPEN_AI_API_KEY)

You can send API requests with the following:

MODEL = "gpt-4o-mini"

def prompt_gpt(prompt):
 return client.chat.completions.create(
 model=MODEL,
 messages=[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": prompt}
 ]
 ).choices[0].message.content

I then create a prompt with the code below. It is difficult to find an effective prompt, and while working on this, I especially struggled with GPT-4o mini providing excessively long answers. To fix this I added two sentences, with Only respond with the list, and nothing else and Sure, here is the list of items and prices. The last part is interesting, as I include the start of the response within the prompt itself, which I have read can help avoid excessively long explanations from the LLM, which is not desired in this case.

prompt = f"Given a string of text from an OCR of a receipt. Find each item and price in the receipt, and return with a list of tuples like this: [(item1, price1), (item2, price2), ...]. Only respond with the list, and nothing else. The string is: {result_string}"
prompt += " . Sure, here is the list of items and prices: " 

You can then prompt GPT-4o mini with:

response = prompt_gpt(prompt)

These are some example responses I received:

# for image 'ReceiptData/20231010_210904.jpg'
[("KylLING HotwiNGS", "57,00"), ("Iskaffe Hocca", "18,90"), ("TORTILLACHIP ZINGY", "16,90"), ("SøTPOTeT FRIES", "37,00"), ("Creamy PEANøTTSHeR", "46,00"), ("GluTEn FReE TORT", "43,90"), ("DIP TEX MEX STyLE", "40,90")]

# for image 'ReceiptData/20231016_180324.jpg'
[('RISTO HOZZA _ 2PK', 89.90), ('SUPERHEL T , GROYBRøP', 35.00), ('B#REPOSE', 25.00), ('Dr Oetker', 26.97)]

# for image 'ReceiptData/20231002_190427.jpg'
[('TøRKEDE APRIKOSER', 29.90), ('MANDLER', 10.90), ('Couscous', 22.40), ('FISKEBURGER HYS8TO', 53.90), ('AVOCADO 2PK', 40.00), ('GRøNNKÅL', 0.00), ('BROKKOLI', 0.00), ('GULROT BEGER 75OGR', 3.00)]

Reviewing results

After setting up all the code to extract information from the receipts, it's time to review how well the method works. I will use a qualitative approach, looking at some receipts individually to judge how well the Information Extraction was performed. I will review the 3 images I showed the output for above.

First receipt

First receipt to review. Image by the author.

For this receipt, you can see that the information extraction pipeline is able to extract all items with the correct process. I think this is quite impressive since the image is not super clear, and there is a lot of background in the image that is not part of the receipt. Unfortunately, there are a few typos in the names of the items, though I think it is acceptable in this context since you can still easily understand what item it is.

Second receipt

Second receipt for review. Image by the author.

On this receipt, the information extraction pipeline struggles. The first two items and prices are correct, but the model gives the incorrect prices for bærepose (this happened because the OCR could not pick the correct price for the item), which confused the Llm. The last line in the items part of the receipt is a discount for the first item, which the LLM is unable to understand, and it incorrectly outputs that this is a separate item. I think this is acceptable; however, since it is difficult for the model to understand that the last row is a discount on a different item rather than an item itself.

Third receipt

The third receipt is for review. Image by the author.

The pipeline performs super well for this receipt for the first four items and then, unfortunately, fails for the remaining four items. This receipt should be easier than the two preceding ones, considering the image is clearer with less background noise, but unfortunately, this is not the case in practice. I looked into the Ocr output for this receipt to understand where the error was coming from. The output is clear and correct for the first four items, while the model suddenly fails for the last four items, showing the incorrect OCR output can cause significant problems for GPT-4o mini.

Understanding how to improve the pipeline

To improve the pipeline, we must understand its weaknesses. As discussed earlier, the problem seems to be with the OCR. To investigate this further, we will look more deeply into the OCR output. The OCR in EasyOCR consists of two steps. Step one is called text detection, which detects the areas where text is present in the image. This is done by marking areas with text using bounding boxes. Step two is called text recognition, and given a bounding box with text inside it, it will output the text present in the bounding box. Thus, the problem with the information extraction pipeline implemented in this article likely lies in one of the two steps.

First, we look into text recognition, which we do by printing out the bounding boxes the OCR finds. You can do this with the following code. First, run the OCR and have the function return bounding boxes as well (which is done by setting detail=1 instead of detail=0 as we did earlier)

result2 = reader.readtext(img, detail=1)

Then, you can print out the bounding boxes for your image with the following:

# Loop through the results and draw bounding boxes on the original image
for (bbox, text, prob) in result2:
    top_left = tuple(bbox[0])
    bottom_right = tuple(bbox[2])

    # Draw the bounding box on the original image
    cv2.rectangle(img, top_left, bottom_right, (0, 255, 0), 2)  # Green box with thickness 2

    # Optionally, put the recognized text on the original image
    cv2.putText(img, text, (top_left[0], top_left[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)

# Now resize the image to a smaller sizescale_percent = 20  # percent of original size
width = int(img.shape[1] * scale_percent / 100)
height = int(img.shape[0] * scale_percent / 100)
dim = (width, height)

# Resize the image
resized_img = cv2.resize(img, dim, interpolation=cv2.INTER_AREA)

# Save or display the resized image with bounding boxes
cv2.imwrite('output_image_with_boxes.jpg', resized_img)
cv2.imshow('Resized Image with Bounding Boxes', resized_img)
cv2.waitKey(0)
cv2.destroyAllWindows()

For receipt three, the receipt below is returned. You can see that the model struggles to read the prices for the last four items. This results in the text recognition step performing poorly, which in turn leads to GPT-4o mini struggling to give accurate responses for the items and prices on the receipt.

Printing out receipt three after the EasyOCR text recognition step. You can see that the model struggles with the prices for the four last items, resulting in the error we saw for receipt two. Image by the author.

The same issues can be seen if you repeat the steps for receipt two.

There are two main approaches to solving this problem. Step one is to take clearer pictures of the receipt, making it easier for the OCR to read the text. However, I think the image of receipt three is quite clear, and the OCR should be able to read it. The other main approach is to improve the OCR, either by using a different OCR engine (for example, PaddleOCR, Tesseract, or a paid OCR service like AWS Textract) or by fine-tuning the OCR, as I have shown how to do in my article on fine-tuning the text recognition part of EasyOCR. Note that while working on this project, I tried both PaddleOCR and Tesseract, which both performed worse than EasyOCR.

Finally, I also tested the Amazon Textract option to see its effectiveness. Below is the result for receipt two, where you can see that Amazon Textract essentially perfectly locates all the text, making it a very effective option for extracting text from receipts. In the next section, I will implement AWS Textract into the pipeline to see how well the application can work.

This image shows how AWS Textract can read the receipt. As you can see, the bounding boxes almost perfectly encapsulate the words, showing the effectiveness of AWS Textract. Image by the author.

Testing out an improved solution with Textract

To see how well the information extraction from receipts can be, I will use AWS Textract for my OCR, instead of EasyOCR. Usually, I would prefer to use an OCR locally since this gives me more control of the process and because part of the fun with AI is working on the models yourself and not just calling API's. However, I want to see how well a paid OCR API service will work in this case. Using AWS Textract requires setting up an AWS account to have access keys to call the Textract API. Note that the initial setup of AWS can be a bit tedious, mostly due to security reasons, but I assure you that setting it up right is well worth your time. This is because you learn how to set up AWS, a popular cloud provider, and you ensure the security of your keys, which is an important practice to maintain, especially if you develop an application with a lot of users. I will not give a tutorial on setting up your account here, but AWS has written some high-quality documentation for this process, and there are also a plethora of other articles on the topic out there.

The Textract service is also quite cheap, with the prices at the time of this writing being 1.5 USD/1000 pages for the first million pages per month and 0.6 USD/1000 pages for over one million pages per month. You also get 1000 free pages per month with the AWS free tier if you use the detect document text API.

After you have set up access credentials, you can use the Textract API with the following code. First, some imports. Note that I store my credentials in a separate file called constants.py.

import boto3
from io import BytesIO
from constants import AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
import os

Then, you have to create a client to call the API with.

def get_aws_textract_client():
 return boto3.client('textract',
      aws_access_key_id=AWS_ACCESS_KEY_ID,
      aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
      region_name=AWS_REGION)

I then use the following functions to call the API.

def get_textract_text_from_image(client, image_path):
    assert os.path.exists(image_path), f"Image file not found: {image_path}"
    with open(image_path, 'rb') as document:
        img = bytearray(document.read())

    # Call Amazon Textract
    response = client.detect_document_text(
        Document={'Bytes': img}
    )
    return response

def extract_text_from_response(response):
    result_string = ""
    for block in response["Blocks"]:
        if block["BlockType"] == "WORD" or block["BlockType"] == "LINE":
            result_string += block["Text"] + " "
    return result_string

And I call the two functions with the following lines

response = get_textract_text_from_image(client, PATH_TO_USE)
result_string = extract_text_from_response(response)

I then run this for receipt three and prompt GPT-4o mini with the extracted text, which gives the following result:

[('TORKEDE APRIKOSER', 29.90), ('MANDLER', 10.90), ('COUSCOUS', 22.40), ('FISKEBURGER HYS&TO', 53.90), ('AVOCADO 2PK 320G', 34.90), ('GRONNKAL 150G', 24.90), ('BROKKOLI', 24.90), ('GULROT BEGER 750GR', 24.90)]

As you can see, AWS Textract and GPT-4o mini can extract all the items from the receipt with the correct prices, except for an incorrect price for the last item. I also tried this for receipt two, which gave the response:

[('RISTO. MOZZA. 2PK 15%', 89.90), ('SUPERHELT GROVBROD 15%', 35.00), ('BAREPOSE 80% RESIR 25%', 4.25), ('30% Dr. Oetker', -26.97)]

In this case, AWS Textract and GPT-4o mini can perfectly extract all the items and prices. Note that GPT-4o mini returns the last item at negative prices, which I think is acceptable and should be dealt with in the frontend of an application.

Conclusion

In this article, I have shown you how to develop an information extraction pipeline to retrieve items and prices from receipts. We started off by implementing EasyOCR to extract the text from the receipts and then used GPT-4o mini to provide the items and prices, given the OCR output. We then reviewed the results on three separate receipts. The review showed that the pipeline performs well for some items, extracting the correct item and prices, though there are some typos for the item name. However, the pipeline fails completely for other items, which can mostly be attributed to OCR errors. In addition to EasyOCR, I also tested Tesseract OCR and PaddleOCR, which did not provide better results for the three receipts in this article. AWS Textract was set up to deal with the OCR errors, which provided much better results than EasyOCR, Tesseract, and PaddleOCR. Using a combination of AWS Textract and GPT-4o mini, we were able quite accurately to export the items and prices from the receipts.

Tags: Gpt Hands On Tutorials Information Extraction Llm Ocr

Comment