The Ultimate Preprocessing Pipeline for Your NLP Models

Author:Murphy | View: 23451 | Time: 2025-03-23 18:43:53

If you have worked on a text summarization project before, you would have noticed the difficulty in seeing the results you expect to see. You have a notion in mind for how the algorithm should work and what sentences it should mark in the text summaries, but more often than not the algorithm sends out results that are "not-so-accurate". Even more interesting is keyword extraction because all sorts of algorithms from topic modeling to vectorizing embeddings, are all really good but given a paragraph as an input the results they give out are again "not-so-accurate" because the most often occurring word is not always the most important word of the paragraph.

Preprocessing and data cleaning requirements vary largely based on the use case you are trying to solve. I will attempt to create a generalized pipeline that should work well for all NLP models, but you will always need to tune the steps to achieve the best results for your use-case. In this story, I will focus on NLP models that solve for topic modelling, keyword extraction, and text summarization.

Preprocessing Pipeline | Image by Author

The image above outlines the process we will be following to build the preprocessing NLP pipeline. The four steps mentioned above, are explained with code later and there is also a Jupyter notebook attached, that implements the complete pipeline together. The idea behind this pipeline is to highlight steps that will enhance the performance of machine learning algorithms that are going to be used on text data. This is a step between input data and model training.

1. Cleaning the text

The first step to structuring the pipeline is cleaning the input text data, which can consist of several steps based on the model you are trying to build and the results you desire. Machine learning algorithms (or largely all computer algorithms, rather every computer instruction) work on numbers, which is why building a model for text data is challenging. You are essentially asking the computer to learn and work on something it has never seen before and hence, it needs a bit more work.

In the section below, I give the first function of our pipeline to perform cleaning on the text data. There are numerous operations parts of the cleaning function, and I have explained them all in the comments of the code.

To see the performance of this function, below is an input to the function and the output that it generates.

input_text = "This is an example from a key football match tweet text with n
a HTML tag, an emoji          Tags:          Clean Code Data Science Machine Learning NLP Workflow          
          
		  
		  Add Fav


      

      
      
      
      
      Murphy
      Add friends
      View space
      Message
      
      
      
      
Recommend
◦ The Unstructured Data Funnel
◦ Vector Representations for Machine Learning
◦ How to Cut RAG Costs by 80% Using Prompt Compression
◦ Similarity Search, Part 7: LSH Compositions
◦ Bridging Domains: Infusing Financial, Privacy, and Software Best Practices into ML Risk Management
◦ How I Self-Study Data Science
◦ Overcoming Developer's Block
◦ Mastering Git: The 3 Essential Workflows for Efficient Version Controlling
◦ Benchmarking LLM Inference Backends
◦ A Fresh Look at Nonlinearity in Deep Learning
◦ End to End AI Use Case-Driven System Design
◦ How to Interpret Matrix Expressions – Transformations