Categorize Free-Text Bank Transaction Descriptions Using BERT

The Situation
I purchased a property towards the end of calendar year 2022 with a mortgage. Given the increase in financial commitments, I wanted to keep a tab on my expenses. It had never occurred to me prior to this point, that I actually had no idea where I have been spending the most. Figuring this out may be a good starting point for my own expense management.
Naturally I turned to the bank transactions data which I downloaded from the online Banking portal in a .csv format. A snippet of this for the last few days of 2022 is provided below.

Based on snippet above, it seems I spent proportionally more on food (as highlighted in green). More importantly, the transaction descriptions are free-text based, is there a way to automatically classify these into a number of pre-defined expense categories (e.g. food, grocery shopping, utilities and etc.)?
There is at least one way using pre-trained Large Language Models like BERT, and this article offers a tutorial as to how!
A 2023 Introduction to BERT
Whilst ChatGPT being a state-of-the-art Text Generation model is attracting a lot of attention at this time, it is generally not considered a General Purpose model – one such as BERT which can be used across multiple Natural Language Understanding tasks. Some examples of these are Grammar Detection, Sentiment Classification, Text Similarity, Q & A Inference and etc.
BERT was developed and released by Google in 2018. It's a pre-trained model using text passages on Wikipedia and BookCorpus (to ensure the training data are grammatically sound).
The BERT model I'll be using for the purpose of this tutorial is available on Hugging Face through the sentence_transformer library, which is a Python framework for creating sentence, text and image embeddings.
Steps for Building the Expense Classifier
How do I ultimately convert the free-text transaction descriptions into an expense category? There are a couple of strategies I can think of. In this tutorial, I'll be providing a step-by-step guide for building the Expense Classifier based on (cosine) similarity of word embeddings. The steps are outlined below:
- Manually label a credible number of transaction descriptions into an expense category (e.g. food, entertainment). This creates a set of labelled training data.
- Parse individual transaction descriptions in the training data above as word embeddings using BERT (i.e. convert texts into a numerical vector). Step 1 and Step 2 collectively ensure that the training data is assigned to a particular expense category as well as a word embedding vector.
- Repeat Step 2 for new transaction descriptions (i.e. convert unseen texts into a numerical vector)
- Pair the word embeddings in Step 3 with the most similar word embeddings from the training data, and assign the same expense category
Python Implementation
This section sets out the Python codes for loading the required packages as well as for implementing the steps as outlined above (apart from Step 1 which is a manual labelling step).
Step 0: Import the required libraries
#for dataframe manipulation
import numpy as np
import pandas as pd
#regular expressoin toolkit
import re
#NLP toolkits
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
#for plotting expense categories later
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import matplotlib
import matplotlib.ticker as ticker # for formatting major units on x-y axis
#for downloading BERT
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
#for finding most similar text vectors
from sklearn.metrics.pairwise import cosine_similarity
Step 1: Label training data
I manually labelled 200 transaction descriptions into a expense category. For instance, the transaction descriptions in Image 1 were assigned an expense category as shown in the image below. I have also assigned categories such as utilities (i.e. for electricity and gas), car and gift to other transactions in the training data.

Step 2: Create word embeddings for training data using BERT
We start by defining a function for cleaning the text data. This includes lower-casing words, removing special characters including dates (which are not useful in informing the expense category).
Stemming, lemmatization or removing stop words which are common practices in an NLP data cleaning pipeline are generally not required when using a BERT model due to its Byte-Pair Encoding and Attention mechanisms.
###############################################
### Define a function for NLP data cleaning ###
###############################################
def clean_text_BERT(text):
# Convert words to lower case.
text = text.lower()
# Remove special characters and numbers. This also removes the dates
# which are not important in classifying expenses
text = re.sub(r'[^ws]|https?://S+|www.S+|https?:/S+|[^x00-x7F]+|d+', '', str(text).strip())
# Tokenise
text_list = word_tokenize(text)
result = ' '.join(text_list)
return result
We then apply the function to the transaction descriptions, loaded as _textraw from the dataframe shown in Image 1 (_df_transactiondescription).
text_raw = df_transaction_description['Description']
text_BERT = text_raw.apply(lambda x: clean_text_BERT(x))
The snippet below shows an example of a particular transaction before and after data cleaning was applied.

We then run the cleaned texts through BERT. I've selected the ‘paraphrase-mpnet-base-v2‘ BERT model known for modelling sentence similarity. Per its documentation on Hugging Face, it maps sentences and paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
######################################
### Download pre-trained BERT model###
######################################
# This may take some time to download and run
# depending on the size of the input
bert_input = text_BERT.tolist()
model = SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings = model.encode(bert_input, show_progress_bar = True)
embedding_BERT = np.array(embeddings)
A snippet of the word embeddings for the first few transactions is provided below:

Step 3: Create word embeddings for unseen data
I've selected 20 transactions from the data which didn't make the training data (for the purpose of this tutorial, randomly selected transactions). These are shown in the image below.

The transaction descriptions above are loaded as _text_testraw. Similar to Step 2, these are run through BERT for embedding.
# Load texts
text_test_raw = df_transaction_description_test['Test']
# Apply data cleaning function as for training data
text_test_BERT = text_test_raw.apply(lambda x: clean_text_BERT(x))
# Apply BERT embedding
bert_input_test = text_test_BERT.tolist()
#model = SentenceTransformer('paraphrase-mpnet-base-v2')
embeddings_test = model.encode(bert_input_test, show_progress_bar = True)
embedding_BERT_test = np.array(embeddings_test)
df_embedding_bert_test = pd.DataFrame(embeddings_test)
Step 4: Pair unseen data with most similar training data
# Find the most similar word embedding with unseen data in the training data
similarity_new_data = cosine_similarity(embedding_BERT_test, embedding_BERT)
similarity_df = pd.DataFrame(similarity_new_data)
# Returns index for most similar embedding
# See first column of the output dataframe below
index_similarity = similarity_df.idxmax(axis = 1)
# Return dataframe for most similar embedding/transactions in training dataframe
data_inspect = df_transaction_description.iloc[index_similarity, :].reset_index(drop = True)
unseen_verbatim = text_test_raw
matched_verbatim = data_inspect['Description']
annotation = data_inspect['Class']
d_output = {
'unseen_transaction': unseen_verbatim,
'matched_transaction': matched_verbatim,
'matched_class': annotation
}
d_output dataframe shows that the unseen data have been assigned a fairly reasonable expense category.

Now whenever new expenses come through, simply feed them to the model!
Bonus Step: Plotting expenses by category
I have actually applied the steps above to all my expenses in calendar year 2022. The plot below shows the resultant expense dollar amounts by the assigned category.

Key observations were:
- I spent the most on Food in 2022, followed by Mortgage repayments and Utility bills.
- Although Credit Card repayment had the highest amount, it is assumed that Credit Card spending could be attributed to other expense categories in the same proportion. This assumption also applies to the PayPal category.
- Based on the data, I probably want to cut back on spending on Food for more spending on Groceries (i.e. start cooking at home as opposed to dining out) in 2023.
- My spending on Beauty products was probably driven by instances where I went shopping with the wife…
In addition, it's super easy to return the transactions with the highest spending within a particular category. For instance, my highest spending in the Food expense category in 2022 were shown in the screen print below. I'm happy with the results as some of these restaurants weren't present in the training data. Despite this, BERT was still able to allocate these transactions to the Food category.

Concluding Thoughts
This article provides a comprehensive tutorial in building an expense tracking tool. All I've done really is translating the free-text transaction descriptions to a language the machine understands using BERT, and letting the machine do the hard yards!
An alternative approach is to replace Step 4 of this tutorial by passing the same wording embeddings through a classification model – something for the readers to experiment with further.
If you like this article of mine, feel free to have a read of the others.
As I ride the AI/ML wave, I enjoy writing and sharing step-by-step guides and how-to tutorials in a comprehensive language with ready-to-run codes. If you would like to access all my articles (and articles from other practitioners/writers on Medium), you can sign up using the link here!