Sentiment Analysis and Structural Breaks in Time-Series Text Data

Author:Murphy | View: 26745 | Time: 2025-03-23 19:31:52

Introduction

Text data contains lots of qualitative information, which can be quantified with various methods, including sentiment analysis techniques. These models are used to identify, extract and quantify emotions from text data and have wide use in business and academic research. Since the text is often recorded on a time-series basis, text datasets might display structural breaks as the quantitative information change due to many possible factors.

As a business analyst, measuring the changes in customer perceptions of a particular brand might be one of the key tasks. In the research role, one can be interested in the shifts in Vladimir Putin's public statements over time. Arabica is a Python library specifically designed to deal with similar questions. It contains these methods for exploratory analysis of time-series text datasets:

arabica_freq for descriptive n-gram-based exploratory analysis (EDA)
cappuccino is a ** visualization module including _heatmap, word cloud, and line plo**_t for unigram, bigram, and trigram frequencies
coffee_break enables sentiment and structural break analysis.

This article will introduce you to Coffee-break, the sentiment and structural breaks analysis module. Read the documentation and these tutorials for the first two methods: arabica_freq, cappuccino.

EDIT Jul 2023: Arabica has been updated. Check the documentation for the full list of parameters.

2. Coffee_break: algorithm and structure

The _coffeebreak module has a simple backend architecture. Here is schematically how it works:

Figure 1. Coffe_break architecture. Source: draw.io

Raw text is cleaned with cleantext providing punctuation and numbers cleaning. Stop words (the most common words in a language with no significant meaning) are not removed in the pre-processing step because they don't negatively affect Sentiment Analysis. However, with skip parameter, we can remove a list of additional stop words or unwanted strings (words or word sequences) that will not impact sentiment analysis.

Sentiment analysis implements VADER (Valence Aware Dictionary and Sentiment Reasoner), __ a universal pre-trained sentiment classifier [1]. It was trained on social media data from Twitter, but it also works very well on other types of datasets. My previous article offers a more detailed introduction to the model and coding in Python.

Coffee_break uses VADER's compound indicator for sentiment evaluation. The aggregate sentiment is calculated as:

where t is the aggregation period. The aggregate indicator ranges [-1: 1] having a positive sentiment closer to 1 and a negative approaching -1.

The aggregated sentiment creates a time series displaying some degree of variability over time. Structural breaks in the time series are identified with the Fisher-Jenks algorithm, or Jenks Optimisation Method originally proposed by George F. Jenks [2].

It is a clustering-based method designed to find the best arrangement of values into different classes (clusters). The jenks_breaks function implemented with jenkspy library returns a list of values that correspond to the limits of the classes. These structural breaks are in the plot marked as vertical lines and visually indicate the breakpoints in the time series of text data.

The implemented libraries are Matplotlib (visualization), vaderSentiment (sentiment analysis), and jenkspy (structural breaks). Pandas and Numpy make the processing operations.

3. Use case: Twitter sentiment analysis

Let's illustrate the coding on Pfizer Vaccine Tweets dataset collected using Twitter API. The data contains 11 000 tweets about Pfizer & BioNTech vaccine posted between 2006 and 2021. The dataset is released under the CC0: Public Domain license according to Twitter developer policy.

The data contains a lot of punctuation and numbers and needs cleaning before any further steps:

The _coffeebreak method's parameters are:

def coffee_break(text: str,                 # Text column
                 time: str,                 # Time column
                 date_format: str,          # Date format: 'eur' - European, 'us' - American
                 time_freq: int ='',        # Aggregation period: 'Y'/'M'
                 preprocess: bool = False,  # Clean data from numbers and punctuation
                 skip: [] ,                 # Remove additional stop words
                 n_breaks: int =''          # Number of breaks: min. 2
)

3. 1. Sentiment analysis over time

Our data has a 15-year time span covering the Covid-19 crisis. Changes in the public mood about vaccination, fake news about vaccines, and many other factors are expected to lead to significant variations in sentiment over time.

Coding

First, import _coffeebreak:

from arabica import coffee_break

Arabica reads dates in US-style (MM/DD/YYYY) and European-style (DD/MM/YYYY) date and datetime formats. The data is pretty raw and covers 15 years. Displaying sentiment by month is, therefore, not very helpful.

Let's clean the data and aggregate sentiment by year with this code:

coffee_break(text = data['text'],
             time = data['date'],
             date_format = 'eur',  # Read dates in European format
             time_freq = 'Y',      # Yearly aggregation
             preprocess = True,    # Clean data - punctuation + numbers
             skip = None ,         # No other stop words removed
             n_breaks = None)      # No structural break analysis

Results

Arabica returns a picture that can be manually saved in PNG or JPEG.

At the same time, Arabica returns a dataframe with the corresponding data. The table can be saved simply by assigning the function to an object :

# generate a dataframe
df = coffee_break(text = data['text'],
                  time = data['date'],
                  date_format = 'eur',
                  time_freq = 'Y',
                  preprocess = True,
                  skip = None ,
                  n_breaks = None)

# save is as a csv
df.to_csv('sentiment_data.csv')

Results interpretation: we can see that sentiment significantly dropped after Pfizer vaccines started to be used to tackle Covid in 2021 (Figure 2). The reason is likely the global pandemic the world faced and the generally negative mood in these years.

3.2. Structural break analysis

Next, let's formalize the structural breaks in sentiment statistically. _Coffebreak enables the identification of min. 2 breakpoints. The following code returns a figure with 3 breakpoints marked by vertical lines and the table with the corresponding time series:

coffee_break(text = data['text'],
             time = data['date'],
             date_format = 'eur', # Read dates in European format
             time_freq = 'Y',     # Yearly aggregation
             preprocess = True,   # Clean data
             skip = None,         # No other stop words removed
             n_breaks = 3)        # 3 breaktpoints

The figure:

Figure 4. Structural break analysis – yearly

Subsetting the data to the two Covid years (2020–2021), we might observe monthly changes in public sentiment, keeping n_breaks = 3 and setting time_freq = 'M' :

Figure 5. Structural break analysis – monthly

The graph is not very informative. There are 1577 rows for 24 time observations in this subset, and after cleaning the raw data, the time series is very volatile. Making conclusions using a clustering-based algorithm on such a limited volume of data is not a good idea.

Results interpretation: the structural break analysis in yearly frequency statstically confirmed what we could see from the time series of sentiment in Figure 3. Fisher-Jenks algorithm identified three structural breaks – in 2009, 2017, and 2021. We can only guess what caused the decline in 2009 and between 2016 and 2018. The 2021's drop is likely reasoned by the Covid-19 crisis.

4. Best practices for structural break analysis

Let's summarize the recommendations for the most effective use of _coffeebreak:

don't use structural break analysis if there are NAN values in the corresponding time series.
identification of more than 3 break points makes sense in longer time series (at least 12 observations).
breakpoint identification might not work well in highly volatile datasets. The reason for dramatic changes might not be the shifts in sentiment but rather the quality of data.
the analysis is only as correct as the underlying sentiment data. Before the actual use, make a short exploration of the raw text dataset to check if (1) it is not too imbalanced in the number of rows for each period and (2) it contains enough information for sentiment evaluation (texts are not too short and don't contain mostly digits and special characters).

Conclusion

A drawback of _coffeebreak is that currently, it only works with English texts. Due to the fact that Arabica is mainly a Pandas-based package (including Numpy vectorization in some parts), _coffeebreak is rather slow in evaluating large datasets. It is time-efficient in processing datasets of up to approx. 40 000 rows.

Read these tutorials to find out more about n-gram and sentiment analysis and visualization of time-series text data:

_Coffeebreak has been developed in cooperation with Prof. Jitka Poměnková (Brno University of Technology). The complete code in this tutorial is on my GitHub.

Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

References

[1] Hutto, C., Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216–225.

[2] Jenks, G.F. (1977). Optimal data classification for choropleth maps. Kansas. University. Dept. of Geography-Meteorology. Occasional paper no. 2.

Tags: Python Sentiment Analysis Text Mining Time Series Analysis