Lexicon-Based Sentiment Analysis Using R

During the Covid-19 pandemic, I decided to learn a new statistical technique to keep my mind occupied rather than constantly immersing myself in pandemic-related news. After evaluating several options, I found the concepts related to natural language processing (NLP) particularly captivating. So, I opted to delve deeper into this field and explore one specific technique: sentiment analysis, also known as "opinion mining" in academic literature. This analytical method empowers researchers to extract and interpret the emotions conveyed toward a specific subject within the written text. Through sentiment analysis, one can discern the polarity (positive or negative), nature, and intensity of sentiments expressed across various textual formats such as documents, customer reviews, and social media posts.
Amidst the pandemic, I observed a significant trend among researchers who turned to sentiment analysis as a tool to measure public responses to news and developments surrounding the virus. This involved analyzing user-generated content on popular social media platforms such as Twitter, YouTube, and Instagram. Intrigued by this methodology, my colleagues and I endeavored to contribute to the existing body of research by scrutinizing the daily briefings provided by public health authorities. In Alberta, Dr. Deena Hinshaw, who used to be the province's chief medical officer of health, regularly delivered updates on the region's response to the ongoing pandemic. Through our analysis of these public health announcements, we aimed to assess Alberta's effectiveness in implementing communication strategies during this intricate public health crisis. Our investigation, conducted through the lenses of sentiment analysis, sought to shed light on the efficacy of communication strategies employed during this challenging period in public health [1, 2].
In this post, I aim to walk you through the process of performing sentiment analysis using R. Specifically, I'll focus on "lexicon-based sentiment analysis," which I'll discuss in more detail in the next section. I'll provide examples of lexicon-based sentiment analysis that we've integrated into the publications referenced earlier. Additionally, in future posts, I'll delve into more advanced forms of sentiment analysis, making use of state-of-the-art pre-trained models accessible on Hugging Face.
Lexicon-Based Sentiment Analysis
As I learned more about sentiment analysis, I discovered that the predominant method for extracting sentiments is lexicon-based sentiment analysis. This approach entails utilizing a specific lexicon, essentially the vocabulary of a language or subject, to discern the direction and intensity of sentiments conveyed within a given text. Some lexicons, like the Bing lexicon [3], classify words as either positive or negative. Conversely, other lexicons provide more detailed sentiment labels, such as the NRC Emotion Lexicon [4], which categorizes words based on both positive and negative sentiments, as well as Plutchik's [5] psych evolutionary theory of basic emotions (e.g., anger, fear, anticipation, trust, surprise, sadness, joy, and disgust).
Lexicon-based sentiment analysis operates by aligning words within a given text with those found in widely used lexicons such as NRC and Bing. Each word receives an assigned sentiment, typically categorized as positive or negative. The text's collective sentiment score is subsequently derived by summing the individual sentiment scores of its constituent words. For instance, in a scenario where a text incorporates 50 positive and 30 negative words according to the Bing lexicon, the resulting sentiment score would be 20. This value indicates a predominance of positive sentiments within the text. Conversely, a negative total would imply a prevalence of negative sentiments.
Performing lexicon-based sentiment analysis using R can be both fun and tricky at the same time. While analyzing public health announcements in terms of sentiments, I found Julia Silge and David Robinson's book, Text Mining with R, to be very helpful. The book has a chapter dedicated to sentiment analysis, where the authors demonstrate how to conduct sentiment analysis using general-purpose lexicons like Bing and NRC. However, Julia and David also highlight a major limitation of lexicon-based sentiment analysis. The analysis considers only single words (i.e., unigrams) and does not consider qualifiers before a word. For instance, negation words like "not" in "not true" are ignored, and sentiment analysis processes them as two separate words, "not" and "true". Furthermore, if a particular word (either positive or negative) is repeatedly used throughout the text, this may skew the results depending on the polarity (positive or negative) of this word. Therefore, the results of lexicon-based sentiment analysis should be interpreted carefully.
Now, let's move to our example, where we will conduct lexicon-based sentiment analysis using Dr. Deena Hinshaw's media briefings during the COVID-19 pandemic. My goal is to showcase two R packages capable of running sentiment analysis