The Ultimate Preprocessing Pipeline for Your NLP Models

Author:Murphy  |  View: 23451  |  Time: 2025-03-23 18:43:53
Photo by Cyrus Crossan on Unsplash

If you have worked on a text summarization project before, you would have noticed the difficulty in seeing the results you expect to see. You have a notion in mind for how the algorithm should work and what sentences it should mark in the text summaries, but more often than not the algorithm sends out results that are "not-so-accurate". Even more interesting is keyword extraction because all sorts of algorithms from topic modeling to vectorizing embeddings, are all really good but given a paragraph as an input the results they give out are again "not-so-accurate" because the most often occurring word is not always the most important word of the paragraph.

Preprocessing and data cleaning requirements vary largely based on the use case you are trying to solve. I will attempt to create a generalized pipeline that should work well for all NLP models, but you will always need to tune the steps to achieve the best results for your use-case. In this story, I will focus on NLP models that solve for topic modelling, keyword extraction, and text summarization.

Preprocessing Pipeline | Image by Author

The image above outlines the process we will be following to build the preprocessing NLP pipeline. The four steps mentioned above, are explained with code later and there is also a Jupyter notebook attached, that implements the complete pipeline together. The idea behind this pipeline is to highlight steps that will enhance the performance of machine learning algorithms that are going to be used on text data. This is a step between input data and model training.

1. Cleaning the text

The first step to structuring the pipeline is cleaning the input text data, which can consist of several steps based on the model you are trying to build and the results you desire. Machine learning algorithms (or largely all computer algorithms, rather every computer instruction) work on numbers, which is why building a model for text data is challenging. You are essentially asking the computer to learn and work on something it has never seen before and hence, it needs a bit more work.

In the section below, I give the first function of our pipeline to perform cleaning on the text data. There are numerous operations parts of the cleaning function, and I have explained them all in the comments of the code.

<script src="https://gist.github.com/rjrahul24/219eb624f9003c2c235509528255ddfd.js"></script>

To see the performance of this function, below is an input to the function and the output that it generates.

input_text = "This is an example from a key football match tweet text with n
a HTML tag, an emoji          

Tags: Clean Code Data Science Machine Learning NLP Workflow

Comment