Natural Language Processing For Absolute Beginners

Author:Murphy | View: 29382 | Time: 2025-03-23 12:48:37

It is mostly true that NLP (Natural Language Processing) is a complex area of computer science. Frameworks like SpaCy or NLTK are large and often require some learning. But with the help of open-source large language models (LLMs) and modern Python libraries, many tasks can be solved much more easily. And even more, results, which only several years ago were available only in science papers, can now be achieved with only 10 lines of Python code.

Without further ado, let's get into it.

1. Language Translation

Have you ever wondered how Google Translate works? Google is using a Deep Learning model trained on a vast amount of text. Now, with the help of the Transformers library, it can be done not only in Google Labs but on an ordinary PC. In this example, I will be using a pre-trained T5-base (Text-to-Text Transfer Transformer) model. This model was first trained on raw text data, then fine-tuned on source-target pairs like ("translate English to German: the house is wonderful", "Das Haus ist Wunderbar"). Here "translate English to German" is a prefix that "tells" the model what to do, and the phrases are the actual context that the model should learn.

Important warning. Large language models are literally pretty large. The T5ForConditionalGeneration class, used in this example, will automatically download the "t5-base" model, which is about 900 MB in size. Before running the code, be sure that there is enough disk space and that your traffic is not limited.

A pre-trained T5 model can be used in Python:

from transformers import T5Tokenizer, T5ForConditionalGeneration

preprocessed_text = "translate English to German: the weather is good"
tokenizer = T5Tokenizer.from_pretrained('t5-base',
                                        max_length=64,
                                        model_max_length=512,
                                        legacy=False)
tokens = tokenizer.encode(preprocessed_text,
                          return_tensors="pt",
                          max_length=512,
                          truncation=True)

model = T5ForConditionalGeneration.from_pretrained('t5-base')
outputs = model.generate(tokens, min_length=4, max_length=32)

print("Result:", tokenizer.decode(outputs[0], skip_special_tokens=True))

#> Result: Das Wetter ist gut.

Here, a T5Tokenizer class converts a source string to a digital form; this process is called tokenization. In our example, a "translate English to German: the weather is good" text will be converted to the [13959, 1566, 12, 2968, 10, 8, 1969, 19, 207, 1] array. A "generate" method is doing the actual job, and finally, the tokenizer is making a backward conversion. As an output, we will get the result "Das Wetter ist gut".

Can we make this code even shorter? Actually, we can. With the help of the Transformer's pipeline class, we can create an abstract pipeline that allows us to do this task in only 2 lines of Python code:

from transformers import pipeline

translator = pipeline("translation_en_to_de", model="t5-base")
print(translator("the weather is good"))

#> [{'translation_text': 'Das Wetter ist gut.'}]

For self-education purposes, I generally prefer the first approach because it is easier to understand what is going on "under the hood". But for "production" purposes, the second way is much more flexible, and it also allows the use of different models without changing the code.

2. Summarization

The goal of text summarization is to transform the document into a shortened version, which obviously will take time if done manually. And surprisingly, a T5 model can do it as well; the only change we need is to change the prefix:

body = '''Obviously, the lunar surface is covered with craters, left 
from previous collisions of meteorites with the Moon. Where does math go? 
While a meteorite collision is a random event, its frequency 
obeys probability theory laws. There is no atmosphere on the Moon's 
surface, no erosion, and no wind. Therefore the lunar surface is an 
ideal "book" in which the events of the last tens of thousands of 
years are recorded. By studying the Moon, we can calculate how often 
such objects fall on its surface.

A study of the lunar surface with high-resolution cameras is ongoing. 
It has been estimated that at least 220 new craters have formed on the 
Moon over the past 7 years. This check is also vital because 
these calculations can help assess the danger to the Earth.'''

preprocessed_text = f"summarize: {body}"

tokenizer = T5Tokenizer.from_pretrained('t5-base', 
                                        max_length=256,
                                        model_max_length=512,
                                        legacy=False)
tokens = tokenizer.encode(preprocessed_text,
                          return_tensors="pt",
                          max_length=256,
                          truncation=True)

model = T5ForConditionalGeneration.from_pretrained('t5-base')
outputs = model.generate(tokens,
                         min_length=4,
                         max_length=64)

print("Result:", tokenizer.decode(outputs[0], skip_special_tokens=True))

#> the lunar surface is an ideal "book" in which the events of the last 
#> tens of thousands of years are recorded. by studying the Moon, we can 
#> calculate how often such objects are falling on its surface.

As we can see, the result is pretty accurate.

In the same way, as with the first example, using a pipeline can produce shorter code for the same task:

summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
summarizer(body, min_length=4, max_length=64)

#> [{'summary_text': 'the lunar surface is an ideal "book" in which the 
#>  events of the last tens of thousands of years are recorded . it has been
#>  estimated that at least 220 new craters have formed on the Moon over the 
#>  past 7 years .'}]

Readers can be curious about other tasks that are possible with a "t5-base" model. We can easily print them all:

for prefix in model.config.task_specific_params:
    print(f"{prefix}: {model.config.task_specific_params[prefix]}")

#> summarization: 
#>   {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
#> translation_en_to_de: 
#>   {'early_stopping': True, 'max_length': 300, 'num_beams': 4, 'prefix': 'translate English to German: '}
#> translation_en_to_fr: 
#>   {'early_stopping': True, 'max_length': 300, 'num_beams': 4, 'prefix': 'translate English to French: '}
#> translation_en_to_ro: 
#>   {'early_stopping': True, 'max_length': 300, 'num_beams': 4, 'prefix': 'translate English to Romanian: '}

3. Question Answering

Another interesting functionality that large language models can provide is answering questions in a given context. I will be using the same piece of text as in the previous example:

body = '''Obviously, the lunar surface is covered with craters, left 
from previous collisions of meteorites with the Moon. Where does math go? 
While a meteorite collision is a random event, its frequency 
obeys probability theory laws. There is no atmosphere on the Moon's 
surface, no erosion, and no wind. Therefore the lunar surface is an 
ideal "book" in which the events of the last tens of thousands of 
years are recorded. By studying the Moon, we can calculate how often 
such objects fall on its surface.

A study of the lunar surface with high-resolution cameras is ongoing. 
It has been estimated that at least 220 new craters have formed on the 
Moon over the past 7 years. This check is also vital because 
these calculations can help assess the danger to the Earth.'''

question_answerer = pipeline("question-answering",
                             model='distilbert-base-cased-distilled-squad')

result = question_answerer(question="Which surface has collision with meteorites",
                           context=body)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

#> Answer: 'the Moon', score: 0.4401, start: 93, end: 101

result = question_answerer(question="How many craters were formed",
                           context=body)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

#> Answer: 'at least 220', score: 0.5302, start: 600, end: 612

result = question_answerer(question="Is there atmosphere on the moon",
                           context=body)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

#> Answer: 'There is no', score: 0.2468, start: 220, end: 231

In this case, I was using a distilbert-base-cased-distilled-squad model, which is 261 MB in size. As we can see, the model is not only providing an answer but also retrieving the position from the original text, which is good for verification of the results.

4. Language Generation

Another fun process is language generation. For this example, I will be using a GPT-2 model. This is obviously not the latest GPT model we have today, but GPT-2 is freely available, and it is small enough (the file size is 548 MB) to run on an ordinary PC.

Let's see how it works:

from transformers import GPT2Config, GPT2Model,
                         TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_ids = tokenizer.encode("I am going to say",
                             return_tensors='tf')
model = TFGPT2LMHeadModel.from_pretrained("gpt2",
                                  pad_token_id=tokenizer.eos_token_id)
output = model.generate(input_ids,
                        max_length=128,
                        early_stopping=True,
                        do_sample=True,
                        top_k=20)
print("Output:", tokenizer.decode(output[0], skip_special_tokens=True))

#> Output: I am going to say something. It is very hard for people to 
#> believe me because I do not have to speak English. But the fact that 
#> you are writing a book on it is one thing. You have to have this 
#> knowledge which is not from what you read; it is from an educated...

In the same way, a much shorter code can be used with a pipeline:

from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')
generator("I am going to say", max_length=128, num_return_sequences=1)

#> I am going to say this: I have had several bad experiences with the 
#> internet in the years since my first connection. While there ...

Practically speaking, I cannot see a lot of sense in these texts, but from a grammar perspective, they are good enough, and for some sort of automation or unit testing, they can be useful. Interestingly, the GPT-2 model was released in 2019. Just for fun, I asked the same question, "Please continue the phrase I am going to say," to GPT-3.5 (released in 2022), and got this answer:

Am going to say that communication is a vital skill in today's interconnected
world. Whether it's expressing your thoughts and ideas, building relationships,
or resolving conflicts, effective communication plays a central role in almost
every aspect of our lives. It's not just about the words we use but also our
tone, body language, and the context in which we communicate.

This result is much better; great progress was achieved in these three years. But obviously, it would not be possible to run a GPT-3.5 model on a regular PC today, even if it were released in the public domain.

5. Sentiment Analysis

A previous example was mostly made for fun, but sentiment analysis is much more important for business. Sentiment analysis is the process of analyzing the sentiment in a given text and finding a subjective opinion. It may be particularly important for web shops, streaming platforms, and any other services where many users can publish reviews.

For this test, I will be using a distilbert-base-uncased-finetuned-sst-2-english model:

from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis",
                         model='distilbert-base-uncased-finetuned-sst-2-english')
data = ["It was not bad",
        "I expected to love it but I was wrong"]
sentiment_pipeline(data)

#> [{'label': 'POSITIVE', 'score': 0.9995607733726501},
#>  {'label': 'NEGATIVE', 'score': 0.997614860534668}]

I deliberately tried to use phrases that were not so easy to analyze, but the model gave correct answers. Obviously, natural language is very flexible, and it is still possible to make a text that will give a false result. For example, this model gave the incorrect answer to the phrase "I expected it to be terrible, but I was mistaken". At the same time, the (many times larger) GPT-3.5 model was able to parse it correctly. On the other side, considering that the DistilBERT model size is only 268 MB and it can be used for free (a model has an Apache 2.0 license), the result is pretty good. Readers can also try other open-source models and choose the best one for their needs.

6. Named Entity Recognition (NER)

Another interesting part of natural language processing is "named entity recognition". It is the process of extracting the entities, like names, locations, dates, etc., from the unstructured text. For this test, I will be using a bert-base-NER model (file size is 433 MB).

Let's consider an example:

from transformers import pipeline

body = "Hi, my name is Dmitrii. I am in London, I work in Super Company, " 
       "I have a question about the hotel reservation."

ner = pipeline("ner",
               model="dslim/bert-base-NER",
               aggregation_strategy='average')
ner(body)

#> [{'entity_group': 'PER',
#>  'score': 0.7563127,
#>  'word': 'Dmitrii',
#>  'start': 15,
#>  'end': 22},
#> {'entity_group': 'LOC',
#>  'score': 0.99956125,
#>  'word': 'London',
#>  'start': 32,
#>  'end': 38},
#> {'entity_group': 'ORG',
#>  'score': 0.99759734,
#>  'word': 'Super Company',
#>  'start': 50,
#>  'end': 63}]

As we can see, the model was able to correctly determine the main entities mentioned in the text, such as name, location, and company.

7. Keyword Extraction

In the last example, we tested NER, but not all parameters were extracted from a text. A separate keyword extraction algorithm can also be useful for the same task. For this example, I will be using KeyBERT:

from keybert import KeyBERT

body = "Hi, my name is Dmitrii. I am in London, I work in Super Company, " 
       "I have a question about the hotel reservation."

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(body,
                                     keyphrase_ngram_range=(1, 1),
                                     diversity=0.8,
                                     stop_words=None)
print(keywords)

#> [('reservation', 0.5935), ('hotel', 0.5729), ('london', 0.2705), 
#>  ('dmitrii', 0.2), ('company', 0.1817)]

As we can see, keyword extraction can also be useful as an addition to NER, and it can find some extra data from the same phrase.

Conclusion

In this article, I tested different algorithms for Natural Language Processing (NLP). As I promised at the beginning of the article, with the help of modern libraries, pretty complex tasks can be solved in 10 lines of Python code. It is also important to mention that all this code can run locally without any API subscriptions or keys. And last but not least, I hope readers can see that NLP can also be fun.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.

Tags: Data Science Deep Learning Naturallanguageprocessing Programming Python