Finding Temporal Patterns in Twitter Posts: Exploratory Data Analysis with Python

In the article "What People Write about Climate" I analyzed Twitter posts using natural language processing, vectorization, and clustering. Using this technique, it is possible to find distinct groups in unstructured text data, for example, to extract messages about ice melting or about electric transport from thousands of tweets about climate. During the processing of this data, another question arose: what if we could apply the same algorithm not to the messages themselves but to the time when those messages were published? This will allow us to analyze when and how often different people make posts on Social Media. It can be important not only from sociological or psychological perspectives but, as we will see later, also for detecting bots or users sending spam. Last but not least, almost everybody is using social platforms nowadays, and it is just interesting to learn something new about us. Obviously, the same algorithm can be used not only for Twitter posts but for any media platform.
Methodology
I will use mostly the same approach as described in the first part about Twitter data analysis. Our data processing pipeline will consist of several steps:
- Collecting tweets including the specific hashtag and saving them in a CSV file. This was already done in the previous article, so I will skip the details here.
- Finding the general properties of the collected data.
- Calculating embedding vectors for each user based on the time of their posts.
- Clustering the data using the K-Means algorithm.
- Analyzing the results.
Let's get started.
1. Loading the data
I will be using the Tweepy library to collect Twitter posts. More details can be found in the first part; here I will only publish the source code:
import tweepy
api_key = "YjKdgxk..."
api_key_secret = "Qa6ZnPs0vdp4X...."
auth = tweepy.OAuth2AppHandler(api_key, api_key_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
hashtag = "#climate"
language = "en"
def text_filter(s_data: str) -> str:
""" Remove extra characters from text """
return s_data.replace("&", "and").replace(";", " ").replace(",", " ")
.replace('"', " ").replace("n", " ").replace(" ", " ")
def get_hashtags(tweet) -> str:
""" Parse retweeted data """
hash_tags = ""
if 'hashtags' in tweet.entities:
hash_tags = ','.join(map(lambda x: x["text"], tweet.entities['hashtags']))
return hash_tags
def get_csv_header() -> str:
""" CSV header """
return "id;created_at;user_name;user_location;user_followers_count;user_friends_count;retweets_count;favorites_count;retweet_orig_id;retweet_orig_user;hash_tags;full_text"
def tweet_to_csv(tweet):
""" Convert a tweet data to the CSV string """
if not hasattr(tweet, 'retweeted_status'):
full_text = text_filter(tweet.full_text)
hasgtags = get_hashtags(tweet)
retweet_orig_id = ""
retweet_orig_user = ""
favs, retweets = tweet.favorite_count, tweet.retweet_count
else:
retweet = tweet.retweeted_status
retweet_orig_id = retweet.id
retweet_orig_user = retweet.user.screen_name
full_text = text_filter(retweet.full_text)
hasgtags = get_hashtags(retweet)
favs, retweets = retweet.favorite_count, retweet.retweet_count
s_out = f"{tweet.id};{tweet.created_at};{tweet.user.screen_name};{addr_filter(tweet.user.location)};{tweet.user.followers_count};{tweet.user.friends_count};{retweets};{favs};{retweet_orig_id};{retweet_orig_user};{hasgtags};{full_text}"
return s_out
if __name__ == "__main__":
pages = tweepy.Cursor(api.search_tweets, q=hashtag, tweet_mode='extended',
result_type="recent",
count=100,
lang=language).pages(limit)
with open("tweets.csv", "a", encoding="utf-8") as f_log:
f_log.write(get_csv_header() + "n")
for ind, page in enumerate(pages):
for tweet in page:
# Get data per tweet
str_line = tweet_to_csv(tweet)
# Save to CSV
f_log.write(str_line + "n")
Using this code, we can get all Twitter posts with a specific hashtag, made within the last 7 days. A hashtag is actually our search query, we can find posts about climate, politics, or any other topic. Optionally, a language code allows us to search posts in different languages. Readers are welcome to do extra research on their own; for example, it can be interesting to compare the results between English and Spanish tweets.
After the CSV file is saved, let's load it into the dataframe, drop the unwanted columns, and see what kind of data we have:
import pandas as pd
df = pd.read_csv("climate.csv", sep=';', dtype={'id': object, 'retweet_orig_id': object, 'full_text': str, 'hash_tags': str}, parse_dates=["created_at"], lineterminator='n')
df.drop(["retweet_orig_id", "user_friends_count", "retweets_count", "favorites_count", "user_location", "hash_tags", "retweet_orig_user", "user_followers_count"], inplace=True, axis=1)
df = df.drop_duplicates('id')
with pd.option_context('display.max_colwidth', 80):
display(df)
In the same way, as in the first part, I was getting Twitter posts with the hashtag "#climate". The result looks like this:

We actually don't need the text or user id, but it can be useful for "debugging", to see how the original tweet looks. For future processing, we will need to know the day, time, and hour of each tweet. Let's add columns to the dataframe:
def get_time(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.time()
def get_date(dt: datetime.datetime):
""" Get date from datetime """
return dt.date()
def get_hour(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.hour
df["date"] = df['created_at'].map(get_date)
df["time"] = df['created_at'].map(get_time)
df["hour"] = df['created_at'].map(get_hour)
We can easily verify the results:
display(df[["user_name", "date", "time", "hour"]])

Now we have all the needed information, and we are ready to go.
2. General Insights
As we could see from the last screenshot, 199,278 messages were loaded; those are messages with a "#Climate" hashtag, which I collected within several weeks. As a warm-up, let's answer a simple question: how many messages per day about climate were people posting on average?
First, let's calculate the total number of days and the total number of users:
days_total = df['date'].unique().shape[0]
print(days_total)
# > 46
users_total = df['user_name'].unique().shape[0]
print(users_total)
# > 79985
As we can see, the data was collected over 46 days, and in total, 79,985 Twitter users posted (or reposted) at least one message with the hashtag "#Climate" during that time. Obviously, we can only count users who made at least one post; alas, we cannot get the number of readers this way.
Let's find the number of messages per day for each user. First, let's group the dataframe by user name:
gr_messages_per_user = df.groupby(['user_name'], as_index=False).size().sort_values(by=['size'], ascending=False)
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)
The "size" column gives us the number of messages every user sent. I also added the "size_per_day" column, which is easy to calculate by dividing the total number of messages by the total number of days. The result looks like this:

We can see that the most active users are posting up to 18 messages per day, and the most inactive users posted only 1 message within this 46-day period (1/46 = 0,0217). Let's draw a histogram using NumPy and Bokeh:
import numpy as np
from bokeh.io import show, output_notebook, export_png
from bokeh.plotting import figure, output_file
from bokeh.models import ColumnDataSource, LabelSet, Whisker
from bokeh.transform import factor_cmap, factor_mark, cumsum
from bokeh.palettes import *
output_notebook()
users = gr_messages_per_user['user_name']
amount = gr_messages_per_user['size_per_day']
hist_e, edges_e = np.histogram(amount, density=False, bins=100)
# Draw
p = figure(width=1600, height=500, title="Messages per day distribution")
p.quad(top=hist_e, bottom=0, left=edges_e[:-1], right=edges_e[1:], line_color="darkblue")
p.x_range.start = 0
# p.x_range.end = 150000
p.y_range.start = 0
p.xaxis[0].ticker.desired_num_ticks = 20
p.left[0].formatter.use_scientific = False
p.below[0].formatter.use_scientific = False
p.xaxis.axis_label = "Messages per day, avg"
p.yaxis.axis_label = "Amount of users"
show(p)
The output looks like this:

Interestingly, we can see only one bar. Of all 79,985 users who posted messages with the "#Climate" hashtag, almost all of them (77,275 users) sent, on average, less than a message per day. It looks surprising at first glance, but actually, how often do we post tweets about the climate? Honestly, I never did it in all my life. We need to zoom the graph a lot to see other bars on the histogram:

Only with this zoom level can we see that among all 79,985 Twitter users who posted something about "#Climate", there are less than 100 "activists", posting messages every day! Ok, maybe "climate" is not something people are making posts about daily, but is it the same with other topics? I created a helper function, returning the percentage of "active" users who posted more than N messages per day:
def get_active_users_percent(df_in: pd.DataFrame, messages_per_day_threshold: int):
""" Get percentage of active users with a messages-per-day threshold """
days_total = df_in['date'].unique().shape[0]
users_total = df_in['user_name'].unique().shape[0]
gr_messages_per_user = df_in.groupby(['user_name'], as_index=False).size()
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)
users_active = gr_messages_per_user[gr_messages_per_user['size_per_day'] >= messages_per_day_threshold].shape[0]
return 100*users_active/users_total
Then, using the same Tweepy code, I downloaded data frames for 6 topics from different domains. We can draw results with Bokeh:
labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
counts = [get_active_users_percent(df_climate, messages_per_day_threshold=1),
get_active_users_percent(df_politics, messages_per_day_threshold=1),
get_active_users_percent(df_cats, messages_per_day_threshold=1),
get_active_users_percent(df_humour, messages_per_day_threshold=1),
get_active_users_percent(df_space, messages_per_day_threshold=1),
get_active_users_percent(df_war, messages_per_day_threshold=1)]
palette = Spectral6
source = ColumnDataSource(data=dict(labels=labels, counts=counts, color=palette))
p = figure(width=1200, height=400, x_range=labels, y_range=(0,9),
title="Percentage of Twitter users posting 1 or more messages per day",
toolbar_location=None, tools="")
p.vbar(x='labels', top='counts', width=0.9, color='color', source=source)
p.xgrid.grid_line_color = None
p.y_range.start = 0
show(p)
The results are interesting:

The most popular hashtag here is "#Cats". In this group, about 6.6% of users make posts daily. Are their cats just adorable, and they cannot resist the temptation? On the contrary, "#Humour" is a popular topic with a large number of messages, but the number of people who post more than one message per day is minimal. On more serious topics like "#War" or "#Politics", about 1.5% of users make posts daily. And surprisingly, much more people are making daily posts about "#Space" compared to "#Humour".
To clarify these digits in more detail, let's find the distribution of the number of messages per user; it is not directly relevant to message time, but it is still interesting to find the answer:
def get_cumulative_percents_distribution(df_in: pd.DataFrame, steps=200):
""" Get a distribution of total percent of messages sent by percent of users """
# Group dataframe by user name and sort by amount of messages
df_messages_per_user = df_in.groupby(['user_name'], as_index=False).size().sort_values(by=['size'], ascending=False)
users_total = df_messages_per_user.shape[0]
messages_total = df_messages_per_user["size"].sum()
# Get cumulative messages/users ratio
messages = []
percentage = np.arange(0, 100, 0.05)
for perc in percentage:
msg_count = df_messages_per_user[:int(perc*users_total/100)]["size"].sum()
messages.append(100*msg_count/messages_total)
return percentage, messages
This method calculates the total number of messages posted by the most active users. The number itself can strongly vary for different topics, so I use percentages as both outputs. With this function, we can compare results for different hashtags:
# Calculate
percentage, messages1 = get_cumulative_percent(df_climate)
_, messages2 = get_cumulative_percent(df_politics)
_, messages3 = get_cumulative_percent(df_cats)
_, messages4 = get_cumulative_percent(df_humour)
_, messages5 = get_cumulative_percent(df_space)
_, messages6 = get_cumulative_percent(df_war)
labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
messages = [messages1, messages2, messages3, messages4, messages5, messages6]
# Draw
palette = Spectral6
p = figure(width=1200, height=400,
title="Twitter messages per user percentage ratio",
x_axis_label='Percentage of users',
y_axis_label='Percentage of messages')
for ind in range(6):
p.line(percentage, messages[ind], line_width=2, color=palette[ind], legend_label=labels[ind])
p.x_range.end = 100
p.y_range.start = 0
p.y_range.end = 100
p.xaxis.ticker.desired_num_ticks = 10
p.legend.location = 'bottom_right'
p.toolbar_location = None
show(p)
Because both axes are "normalized" to 0..100%, it is easy to compare results for different topics:

Again, the result looks interesting. We can see that the distribution is strongly skewed: 10% of the most active users are posting 50–60% of the messages (spoiler alert: as we will see soon, not all of them are humans;).
This graph was made by a function that is only about 20 lines of code. This "analysis" is pretty simple, but many additional questions can arise. There is a distinct difference between different topics, and finding the answer to why it is so is obviously not straightforward. Which topics have the largest number of active users? Are there cultural or regional differences, and is the curve the same in different countries, like the US, Russia, or Japan? I encourage readers to do some tests on their own.
Now that we've got some basic insights, it's time to do something more challenging. Let's cluster all users and try to find some common patterns. To do this, first, we will need to convert the user's data into embedding vectors.
3. Making User Embeddings
An embedded vector is a list of numbers that represents the data for each user. In the previous article, I got embedding vectors from tweet words and sentences. Now, because I want to find patterns in the "temporal" domain, I will calculate embeddings based on the message time. But first, let's find out what the data looks like.
As a reminder, we have a dataframe with all tweets, collected for a specific hashtag. Each tweet has a user name, creation date, time, and hour:

Let's create a helper function to show all tweet times for a specific user:
def draw_user_timeline(df_in: pd.DataFrame, user_name: str):
""" Draw cumulative messages time for specific user """
df_u = df_in[df_in["user_name"] == user_name]
days_total = df_u['date'].unique().shape[0]
# Group messages by time of the day
messages_per_day = df_u.groupby(['time'], as_index=False).size()
msg_time = messages_per_day['time']
msg_count = messages_per_day['size']
# Draw
p = figure(x_axis_type='datetime', width=1600, height=150, title=f"Cumulative tweets timeline during {days_total} days: {user_name}")
p.vbar(x=msg_time, top=msg_count, width=datetime.timedelta(seconds=30), line_color='black')
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
p.toolbar_location = None
p.x_range.start = datetime.time(0,0,0)
p.x_range.end = datetime.time(23,59,0)
p.y_range.start = 0
p.y_range.end = 1
show(p)
draw_user_timeline(df, user_name="UserNameHere")
...
The result looks like this:

Here we can see messages made by some users within several weeks, displayed on the 00–24h timeline. We may already see some patterns here, but as it turned out, there is one problem. The Twitter API does not return a time zone. There is a "timezone" field in the message body, but it is always empty. Maybe when we see tweets in the browser, we see them in our local time; in this case, the original timezone is just not important. Or maybe it is a limitation of the free account. Anyway, we cannot cluster the data properly if one user from the US starts sending messages at 2 AM UTC and another user from India starts sending messages at 13 PM UTC; both timelines just will not match together.
As a workaround, I tried to "estimate" the timezone myself by using a simple empirical rule: most people are sleeping at night, and highly likely, they are not posting tweets at that time