How to Leverage Pre-Trained Transformer Models for Custom Text Categorisation?

Ok let's get straight to the point! You have some custom data and now you want to categorise it into your custom classes. In this article, I will show you how you can use 2 approaches to achieve this objective. Both of them utilise pre-trained state of the art transformed based models.
Please note that the goal of this article is to share with you the approaches and how to use them. This is in not a complete Data Science tutorial with best practices. Unfortunately that is outside the scope of this article.
All the code from this article can be found in this GitHub repo.
1: Zero Shot Classification
Overview
Zero-shot classification is a technique that allows you to classify text into categories without training a specific model for that task. Instead, it uses pre-trained models that have been trained on a large amount of data to perform this classification. The models are typically trained on a variety of tasks, including language modelling, text completion, and text entailment, among others.

To perform zero-shot classification, you simply need to provide the pre-trained model with some text and a list of possible categories.
The model will then use its understanding of language and its pre-existing knowledge to classify the text into one of the provided categories. This approach is particularly useful when you have limited data available for a specific classification task, as it allows you to leverage the pre-existing knowledge of the model.
Since it does the classification without any training on that particular task, it's know as zero-shot.
Implementation
All we need to implement this is to install the hugging face transformers library using pip install transformers
. We will use the pre-trained Facebook BART (Bidirectional and Auto-Regressive Transformers) model for this task.
Side Note: on first use, it will take some time to download the model.
The output is a dictionary with 3 keys:
- sequence: input text that was classified by the pipeline
- labels: list of candidate (category) labels provided to the pipeline, ordered based on their probability scores.
- scores: probabilities scores assigned to each candidate label based on the model's prediction of how likely the input text belongs to that label.
from transformers import pipeline
pipe = pipeline(model="facebook/bart-large-mnli")
pipe("Tesco Semi Skimmed Milk 1.13L/2 Pints ...... £1.30",
candidate_labels=["groceries", "utility", "electronics", "subscriptions"],
)
# output
>> {'sequence': 'Tesco Semi Skimmed Milk Pints',
'labels': ['groceries', 'utility', 'subscriptions', 'electronics'],
'scores': [0.9199661612510681,
0.05123506113886833,
0.022794339805841446,
0.0060044946148991585]}
As you can see, without any training, the model has correctly classified the given text into the "groceries" category. Because the model was trained on a large corpus of text in a given language, it can understand that language and draw inference. it understood the text and identified a suitable category from the list of candidate labels.
Simply put, it's brilliant!!