Techniques for Chat Data Analytics with Python

Author:Murphy  |  View: 28624  |  Time: 2025-03-22 19:39:27

Photo by Mikechie Esparagoza and obtained from Pexels.com

In the first part of this series, I introduced you to my artificially created friend John, who was nice enough to provide us with his chats with five of the closest people in his life. We used just the metadata, such as who sent messages at what time, to visualize when John met his girlfriend, when he had fights with one of his best friends and which family members he should write to more often. If you didn't read the first part of the series, you can find it here.

What we didn't cover yet but we will dive deeper into now is an analysis of actual messages. Therefore, we will use the chat between John and Maria to identify the topics they discuss. And of course, we will not go through the messages one by one and classify them – no, we will use the Python library BERTopic to extract the topics that the chats revolve around.

What is BERTopic?

BERTopic is a topic modeling technique introduced by Maarten Grootendorst that uses transformer-based embeddings, specifically BERT embeddings, to generate coherent and interpretable topics from large collections of documents. It was designed to overcome the limitations of traditional topic modeling approaches like LDA (Latent Dirichlet Allocation), which often struggle to handle short texts or produce consistent topics across different document collections.

In this blog, I will not dive into the theoretical background of BERTopic – if you are interested in this, I highly recommend the following articles by the BERTopic legend himself:

If you want to follow along, you should install BERTopic using pip, along with the sentence-transformers package, which we will use for the model.

pip install sentence-transformers
pip install bertopic

The Data

We will use chat data artificially created by ChatGPT. If you'd like to extract your own chats from WhatsApp and follow the topic extraction process, you can read this blog for how I did it. I won't go into detail about the transformation steps, but you can find the Python code [here](https://github.com/Robinvm96/Chat-Analytics/blob/main/02_Chatdata.xlsx) and my structured example data here. After applying the transformations, we will arrive at the following data structure:

Image by Author
  • Date: When the message has been sent
  • Chat: Which chat – In our case always maria
  • Author: The individual who delivered the respective message.
  • Message: The content of what was said.

Topic Extraction

The amazing thing about BERTopic is that not much data preprocessing is necessary. The idea is to keep it as easy as possible, allowing users to focus on extracting meaningful insights without getting bogged down.

import pandas as pd
from bertopic import BERTopic

data = pd.read_excel(r"") #load your data

In the next step we load our model and apply it to our data.

topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics, probs = topic_model.fit_transform(data['Message'])

To get a first impression, I will start with an overview of the topics generated. This includes how many topics have been created, which words represent these topics and which sentences are included in each. Here, we also touch on the important core idea behind topic creation: it is not the case that each topic returns exactly one word. Instead, topics usually consist of a collection of words because a single word cannot capture all the nuances of a topic. This approach allows users more opportunities to interpret each cluster.

Input:

topic_model.get_topic_info().head(5)

Output:

Image by Author

Each topic is labeled with a number, with the label "-1" indicating outliers that can't be assigned to any specific topic. Currently, I am displaying only the first five topics. My analysis identified a total of 23 topics based on 1090 messages, with around 30% of all messages classified as outliers. We could dive deeper into these outliers to determine whether they truly don't fit any topic or if they contain content irrelevant to the identified topics. However, since 70% of the messages are clearly assigned to topics , I will focus on those.

From topics 0 through 4, we can already glean some initial insights into the clusters. For instance, Topic 1 appears to focus on cocktails, while Topic 3 seems to involve discussions about teachers and students. This provides a preliminary impression of what the messages in each topic might entail, though it's too early to draw any firm conclusions. On the other hand, Topic 0 and Topic 2 appear to contain more generic terms that might be considered stopwords rather than topic-specific words. While Topic 0 could perhaps be categorized as "Plans," Topic 2 lacks any clear keywords that suggest a specific topic. So simply looking at the first 5 rows Topics gives us already some relevant insights:

  • 70% of the Messages are assigned to a topic.
  • Topic 1 and 3 gives already a good impression of their focus. (Cocktails & Teachers)
  • Topic 0 might relate to planning, but no clear topic stands out otherwise.
  • Topic 2 requires further inspection, as no clear topic can be assigned to it.

We can keep these initial insights in mind as we continue with our analysis. While I won't be doing it right now, it could be interesting to create a sorted bar chart based on the count of messages assigned to each topic, along with the topics themselves. This would give you an impression of whether the topics are equally important in the conversation, or if the distribution is skewed, with just a few topics dominating the discussion with your friend. I'll skip this analysis for now and move directly to examining the topics themselves.

As you may have noticed, the "name" column contains the topic number followed by underscores and several words. The order of these words generally reflects their significance to the topic. While the first word may carry substantial weight, in some cases, the significance is more evenly distributed across the first few words. To analyze this, we'll use some visualization functions integrated into BERTopic.

Let's start with simple bar charts:

Input:

topic_model.visualize_barchart(topics=list(range(23)))

Output:

Image by Author

This visualization helps to identify how the importance of various words differs within each topic cluster. In some topics, we can clearly see that certain words have a higher importance than others. This indicates that these words should be central to labeling the topics. The following topics appear to have one or two significant keywords:

  • Topic 3: Teachers
  • Topic 8: family
  • Topic 10: Church
  • Topic 12: Our Baby
  • Topic 13: Travel & Trips
  • Topic 15: Exactly
  • Topic 16: Proposal
  • Topic 17: Parents
  • Topic 19: Marriage
  • Topic 20: We

These are already quite strong topics that you could connect to specific groups of sentences, although Topics 15 (exactly) and 20 (We) are still unclear.

Another observation from the bar chart is that some topics help form a clearer picture of what John and Maria are writing about. What's particularly interesting is not only whether a single word dominates a cluster, but also when multiple words belong to the same logical family. For example, the following topics could likely be grouped together:

  • Topic 1: Cocktails
  • Topic 5: Car Accident
  • Topic 7: Extreme Sport
  • Topic 11: Communication
  • Topic 22: Outgoing with friends

As you review the clusters, you may notice a common challenge with topic extraction: it will never be perfect or entirely automated. While many topics make sense – such as Church, Marriage, Family, and Travel – there are also topics that require further investigation, like Topics 15 and 20. These may represent stopwords that were frequently used.

Now, let's recap the insights we generated from the bar charts and the Topic Word Scores analysis:

  • A group of topics is dominated by one or two words within their clusters, and most of them make logical sense.
  • A group of topics provides a clear indication of what the conversation is about because many of the words in these clusters belong to a logical context.
  • Some topics still don't make sense because they do not form a clear cluster.

With this in mind, let's proceed by visualizing the entire landscape of messages assigned to their respective categories.

Input:

topic_model.visualize_documents(data['Message'], topics=list(range(23)), custom_labels=True, height=600)

Output:

Image by Author

Each bubble in this visualization represents a message spoken by either John or Maria. The colors correspond to their respective topics. The axes are labeled with values obtained from dimensionality reduction, so they don't have direct interpretations. When topics are positioned close to each other, it indicates semantic similarity between them. This proximity suggests that the topics share related themes, vocabulary, or contextual meanings within the messages.

As you can see, this static view doesn't provide much insight on its own. However, when created in Python, the visualization allows for interactive exploration of the message universe, enabling you to view the individual messages. From there, you can decide what to do with certain clusters – such as merging them with others or removing them entirely.

For simplicity, I will select a few clusters to make the visualization clearer and further group the topics.

Input:

# Specify the topics you want to visualize
selected_topics = [1,3,5,7,8,10,12,13,16,17,19]

# Visualize only the selected topics
topic_model.visualize_documents(data['Message'], topics=selected_topics, custom_labels=True, height=600)

Output:

Image by Author

To make the topics even more accessible, I will now label them.

Input:

#Label Topics
topic_model.set_topic_labels({1:"Cocktails", 3: "Teacher", 5: "Car Accident",7: "Extreme Sport",8: "Family",10: "Church",12: "Our Baby",13: "Travel",16: "Proposal",17: "Parents",19: "Marriage"})

# Specify the topics you want to visualize
selected_topics = [1,3,5,7,8,10,12,13,16,17,19]

# Visualize only the selected topics
topic_model.visualize_documents(data['Message'], topics=selected_topics, custom_labels=True, height=600)

Output:

Image by Author

Now, we've clearly identified a portion of what John and Maria are discussing. Remember, the closer the topics are to each other, the higher their semantic similarity.

Let's try clustering the topics. One group of topics seems to revolve around Family, Marriage, Proposal, and Their Baby. This strongly suggests that John and Maria are a married couple with children or they are planning to get children. This appears to be a significant theme in their lives.

The second major theme seems to center around their leisure activities. They're discussing topics such as Church, Traveling, Extreme Sports and Car Accident. If we want to dive deeper into this, we could perform further analyses, such as sentiment analysis on the messages within these topics. For example, the Extreme Sports topic might have a more negative tone for Maria compared to John and she could be trying to convince him to stop. Understanding how each person feels about certain topics could offer valuable insights into the nature of their discussions. However for now this would be just speculation.

Finally, I would categorize the Teacher and Cocktails topics as separate clusters, as they don't seem to fit well with the others. It's interesting that the Teacher cluster stands out so clearly , cause after reading the messages, we can see that John and Maria were actually discussing the shortage of teachers in schools.


Conclusion

In this blog post, we used the Python library BERTopic to analyze John's chat with Maria. By applying the model to their conversation, we identified clear and personal topics they discussed. While we cannot draw definitive conclusions without deeper exploration of their communication, we can already infer several things. For example, it seems that one of them is religious or at least has a connection to the church. We also observed that their relationship appears to be intense, likely indicating they are married or have children, or perhaps planning to start a family. Additionally, we uncovered that their hobbies include extreme sports, and even a car accident was part of their conversation.

Through this analysis, we've shown that by applying topic modeling to their chat, it's not necessary to read through all 1,000 messages to get a clear sense of the key topics they are discussing. This approach provides a quick and effective way to understand the central themes in a conversation.

However, we have only scratched the surface of topic extraction by identifying just a portion of what John and Maria were talking about. There are many more avenues to explore:

  • Correlation Heatmap: Which topics are worth merging?
  • Topics Over Time: Did the topics evolve as the series progressed?
  • Topic Hierarchy: Is there a hierarchy among the topics?

Thanks for joining me on this journey through chat analysis! If you enjoyed exploring the intricacies of John and Maria's conversations, I'd appreciate a clap or a follow – your support fuels my creativity!

If you didn't read it, check out the first part of the series to see what happens when you apply the model to your own WhatsApp chats about your relationships with family and friends! The codes an analysis of this Blog you can find on my Github profile.

https://medium.com/towards-data-science/techniques-for-chat-data-analytics-with-python-4c15d3f5498c


Python Tutorials for Digital Humanities. (2024). How to use BERTopic – Machine Learning Assisted Topic Modeling in Python. YouTube. Available at: https://www.youtube.com/watch?v=v3SePt3fr9g (Accessed: 29 October 2024).

Egger, R. and Yu, J., 2022. A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers in Sociology, 7, p.886498. doi: 10.3389/fsoc.2022.886498. Available at: https://pmc.ncbi.nlm.nih.gov/articles/PMC9120935/ [Accessed 29 Oct. 2024].

Grootendorst, M., 2020. Topic Modeling with BERT. Towards Data Science. Available at: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6 [Accessed 31 October 2024].

Grootendorst, M., 2021. Interactive Topic Modeling with BERTopic. Towards Data Science. Available at: https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8 [Accessed 31 October 2024].

Grootendorst, M., 2022. Topic Modeling with BERT. Towards Data Science. Available at: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6 [Accessed 31 October 2024].

Mitra Mirshafiee. (2020). The Big Bang Theory Series Transcript. [Online] Available at: https://www.kaggle.com/datasets/mitramir5/the-big-bang-theory-series-transcript [Accessed: 2 November 2024].

Tags: Data Science Hands On Tutorials Machine Learning Programming Python

Comment