Topic Modelling with BERTtopic in Python

Author:Murphy  |  View: 23455  |  Time: 2025-03-22 22:14:11
Photo by Harryarts on Freepik

Topic modeling (i.e., topic identification in a corpus of text data) has developed quickly since the Latent Dirichlet Allocation (LDA) model was published. This classic topic model, however, does not well capture the relationships between words because it is based on the statistical concept of a bag of words. Recent embedding-based Top2Vec and Bertopic models address its drawbacks by exploiting pre-trained language models to generate topics.

In this article, we'll use Maarten Grootendorst's (2022) BERTopic to identify the terms representing topics in political speech transcripts. It outperforms most traditional and modern topic models in topic modeling metrics on various corpora and has been used in companies, academia (Chagnon, 2024), and the public sector. We'll explore in Python code:

  • how to effectively preprocess data
  • how to create a Bigram topic model
  • how to explore the most frequent terms over time.

1. Example data

As an example dataset, we'll use the Empoliticon: Political Speeches-Context & Emotion dataset, released under the Attribution 4.0 International license, as part of the Efat et al. (2023) paper. It contains 2010 transcripts of political speeches from the presidents/ prime ministers of the USA, UK, China, and Russia. To make the topic model more focused, the subset only includes the 556 speeches of leaders from Russia:

Source: Emopoliticon: Political Speeches-Context & Emotion dataset

2. Data pre-processing

Working with text datasets is complex. Just cleaning involves several steps that should systematically remove all unnecessary information from the dataset. Check all requirements for this project here.

2.1. Fixing mojibake errors

Mojibake is a Japanese word for the confusing text that results from character-encoding errors. Here is an example:

Mojibake example

It is useful to include this step right at the beginning of the cleaning. Correcting encoding-related errors is simple:

<script src="https://gist.github.com/PetrKorab/5e42fc26362e392688263eb42eec4d2d.js"></script>

2.2. Cleaning special characters, punctuation, and numbers

This step should come right after fixing the encoding errors. The simplest way is to use the cleantext library. Also, consider lower-casing. Does "labor" mean the same as "Labor" in the dataset? In case it does, add a lowercase parameter and apply the cleaning function:

<script src="https://gist.github.com/PetrKorab/385d1fc6362188bd342c5ae0ff0df57a.js"></script>

2.3. Define the stopwords removal strategy

Removing the standard list of stopwords is generally a necessary step. Depending on the project focus, it might also be useful to clean data from an additional list of stopwords that don't bring any value. As written in the BERTopic's documentation:

Removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings.

Instead, we use CountVectorizer to preprocess our documents after having generated embeddings during the topic model generation.

3. Topic generation

Having a cleaner dataset, it is now possible to remove English stopwords along with a list of additional stopwords, generate a topic bigram model, and apply it to the data.

<script src="https://gist.github.com/PetrKorab/7e6689c1fb2a16690f8cd70a95136ce0.js"></script>

Note that the nr_topicsparameter is set to 7 for generating 6 topics. The remaining topic is used to keep the outliers.

4. Topic visualization

In the next step, let's visualize the data in a heatmap to present the results better. Here is the outcome:

Figure 1: Heatmaps with bigrams and their probabilities, Image by Author

To do so, we'll extract bigrams and their probabilities from the topic model and create a data frame for each of the 6 topics:

<script src="https://gist.github.com/PetrKorab/59a9ee9fb6e9782a890617408cbd4a2f.js"></script>

Next, this code creates a heatmap in Figure 1.

<script src="https://gist.github.com/PetrKorab/90643177780797b1649a08ef3bc3566d.js"></script>

5. Token frequencies over time

Now, we'll add a perspective on the development of bigrams over time. The goal is to look at which years the bigrams in the Russian leader(s) speeches were most frequently spoken out. The heatmap in Figure 2 displays the frequencies of the 5 most frequent bigrams for each year.

Figure 2: Heatmaps with bigrams and their frequencies by year, Image by Author

The arabica library, which is now forthcoming in the Journal of Open Source Software (Koráb & Poměnková, 2024), was developed for this purpose.

EDIT Jul 2024: Arabica has been updated. Check the documentation for the full list of parameters.

Here is the code generating the heatmap in Figure 2:

<script src="https://gist.github.com/PetrKorab/ba3e3d26532bfde675ce9ef7118c44de.js"></script>
Image by rawpixel on Freepik

Conclusions

This article briefly introduced topic modeling with BERTopic. The model's framework offers many extensions, fine-tuning, and visualization methods (see the documentation). Let's summarize the key findings:

  • topic models show 6 distinct topics for defense policy (topic 1), economic development (topic 2), WW2 (topic 3 ), internal policies (topic 4), healthcare and demographics (topic 5), and education (topic 6).
  • combining BERTopic with Arabica, we can see that the foreign and defense policy topics ("armed forces", "russian federation", "law enforcement") were more frequently discussed before 2012, while there is a shallow frequency of topics discussed related to education and healthcare, especially after 2010.
  • The dataset contains more WW2, foreign, and defense policy terms because Arabica returns absolute frequencies. However, it's difficult to interpret the results well without knowing the regional context.

My previous article briefly explains a simpler approach to topic modeling with LDA. The complete code for this tutorial is on my GitHub.

If you enjoy my work, you can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

References

[1] Blei, Ng, Jordan (2003). Latent Dirichlet Allocation. Journal Of Machine Learning Research 3, pp. 993–1022.

[2] Chagnon, Pandolfi, Donatelli, Ushizima (2024). Benchmarking topic models on scientific articles using BERTeley. Natural Language Processing Journal 6.

[3] Efat, Atiq, Abeed, Momin, Alam (2023). Empoliticon: NLP And MLBased Approach For Context And Emotion Classification Of Political Speeches From Transcripts. IEEE Access, vol. 11.

[4] Grootendorst (2022). Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. Computer Science.

[5] Koráb, Poměnková (2024). Arabica: A Python package for exploratory analysis of text data. In The Journal of Open Source Software. Journal of Open Source Software. https://doi.org/10.5281/zenodo.10866697.

Tags: Bertopic Data Science Python Text Mining Topic Modeling

Comment