EDA with Polars: Step-by-Step Guide to Aggregate and Analytic Functions (Part 2)

Author:Murphy  |  View: 23098  |  Time: 2025-03-23 18:07:43
Photo by Spencer Davis on Unsplash

Introduction

In the the first part of this series we've covered the basics of Polars and compared its functionalities and syntax with Pandas. This part will take the complexity of our queries a notch further, so we'll see how to perform some fairly complex aggregates, rolling statistics, and more. If you're not familiar with Polars or feel like you need a refresher, make sure to check out the previous part. Otherwise, let's continue exploring Polars!

Setup

As in the last part, make sure to clone/pull this GitHub repo since it has all the code you need for this post. In particular, we'll be covering this notebook, so make sure to get it if you want to follow along.

Data used in this project can be downloaded from Kaggle (CC0: Public Domain). I assume that you already have Polars installed, so just make sure to update it to the latest version using pip install -U polars .

Data Processing

Reading Data

Similar to the last post, let's read in the UK trending dataset and the mappings for category_id columns.

csv_path  = './youtube/GBvideos.csv'
json_path = './youtube/US_category_id.json'

df = pl.read_csv(csv_path)

with open(json_path, 'r') as f:
    categories = json.load(f)

id_to_category = {}
for c in categories['items']:
    id_to_category[int(c['id'])] = c['snippet']['title']

Cleaning Data

Next, let's parse the dates and map category IDs to the category names. To make it more production ready, I'm going to put the date parsing code into a generalisable function.

def parse_dates(df: pl.DataFrame, date_cols: Dict[str, str]) -> pl.DataFrame:
    expressions = []
    for date_col, format in date_cols.items():
        expressions.append(pl.col(date_col).str.to_date(format=format))

    df = df.with_columns(expressions)
    return df

# Column name with expected date format
date_column_format = {
    "trending_date": '%y.%d.%m',
    "publish_time": '%Y-%m-%dT%H:%M:%S%.fZ'
}

df = parse_dates(df, date_column_format).with_columns(
    pl.col("category_id").map_dict(id_to_category)
)

Notice that because the function return a DataFrame we can chain the category mapping to create an elegant (almost) one-linear cleaning code.

Feature Engineering

Frequently, you'll need to create new features derived from the ones you already have. For example, in this YouTube dataset we can calculate the number of days it took for a video to get into the Trending tab since we have publish_time and trending_date . Additionally, we can also calculate different ratios with likes, dislikes, comments and views.

df = df.with_columns(
    time_to_trending=pl.col("trending_date") - pl.col("publish_time").dt.date(),
    likes_to_dislike_ratio=pl.col("likes") / pl.col("dislikes"),
    likes_to_views_ratio=pl.col("likes") / pl.col("views"),
    comments_to_views_ratio=pl.col("comment_count") / pl.col("views"),
)

# Sense check 2 features
print(df[["trending_date", "publish_time", "time_to_trending"]].sample(2))
print(df[["likes", "dislikes", "likes_to_dislike_ratio"]].sample(2))
Sense checks of derived features. Screenshots by author.

Furthermore, we can extract year, month, and weekday from the trending_date column which can be used in further analysis.

df = df.with_columns(
    trending_weekday = pl.col('trending_date').dt.weekday(),
    trending_month = pl.col('trending_date').dt.month(),
    trending_year = pl.col("trending_date").dt.year()
)

Finally, the dataset is pre-processed and ready for further analysis. Let's start by exploring the time it takes to get into trending in more depth.

Time to Trending Analysis

Since a video can appear in the trending dataset multiple times (i.e. on different dates), we need to first create a data frame with one video per row and the correct time it took it to get into trending. For example, consider this video that appears in the dataset 3+ times.

Screenshot by author.

The actual time it took to get into trending is 1 day and then it stayed in trending for a few days. We can use a .groupby() method to group by video_id and get a few useful values:

  • Time it took to get into trending (minimum value of time_to_trending)
  • Time it was in the Trending tab (maximum value minus minimum value of trending_date)

In addition, we'd like to bring in other information about the video e.g. channel_title or title for further analysis. To do this, we can use the .groupby() method with multiple by arguments.

time_to_trending_df = df.groupby(
    ["video_id", "title", "category_id", "channel_title"]
).agg(
    pl.col("time_to_trending").min().dt.days(),
    pl.col("trending_date").min().dt.date().alias("first_day_in_trending"),
    pl.col("trending_date").max().dt.date().alias("last_day_in_trending"),
    (pl.col("trending_date").max() - pl.col("trending_date").min()).dt.days()
    .alias("days_in_trending"),
)

print(f"Average time to trending is {time_to_trending_df['time_to_trending'].mean()} days")
print(f"Median time to trending is {time_to_trending_df['time_to_trending'].median()} days")

>>> Average time to trending is 36.25735294117647 days
>>> Median time to trending is 2.0 days

Interestingly, the average time to trending is much larger than the median indicating that there are large outliers in this data. To not bias the further analysis, let's filter out all the videos that took longer than 60 days to get into trending.

time_to_trending_df = time_to_trending_df.filter(pl.col("time_to_trending") <= 60)
print(f"Average time to trending is {time_to_trending_df['time_to_trending'].mean()} days")
print(f"Median time to trending is {time_to_trending_df['time_to_trending'].median()} days")

>>> Average time to trending is 3.7225 days
>>> Median time to trending is 2.0 days

Notice how easy it is to compose and stack quite complicated queries!

Fastest Categories & Channels to Get Into Trending

First things first, let's do this analysis only for the categories that have appeared in trending at least 100 times since statistics for infrequent categories won't be as reliable. There are many ways of filtering out these categories (e.g. using .value_counts() ) but this time let's use pl.count().over() to introduce a new expression. Using this method, we can count the number of rows over category_id which should add a new column times_in_trending to the time_to_trending_df . You can see that the syntax is quite similar to PySpark and SQL, so it should be familiar to the most of data professionals.

fastest_category_to_trending = (
    time_to_trending_df.with_columns(
        # Count over category ID
        times_in_trending=pl.count().over("category_id")
    # Filter infrequent categories
    ).filter(pl.col("category_times_in_trending") >= 100)
    # Calculate mean time to trending
    .groupby("category_id")
    .agg(pl.col("time_to_trending").mean())
    .sort("time_to_trending")
)

After counting the number of occurrences per category, it's quite trivial to filter the infrequent categories out and to calculate the average time to trending by group. While this gets the job done, we'll need to copy-paste this query for the channel_title level analysis as well which is not very elegant. Instead, let's create a function with a few available parameters to make it more generic.

def avg_frequent(
    df: pl.DataFrame,
    by: str,
    frequency_threshold: int,
    metric: str = "time_to_trending",
) -> pl.DataFrame:
    results = (
        df.with_columns(times_in_trending=pl.count().over(by))
        .filter(pl.col("times_in_trending") >= frequency_threshold)
        .groupby(by)
        .agg(pl.col(metric).mean())
        .sort(metric)
    )

    return results

fastest_category_to_trending = avg_frequent(
    time_to_trending_df, by="category_id", frequency_threshold=100
).head(3)
fastest_channel_to_trending = avg_frequent(
    time_to_trending_df, by="channel_title", frequency_threshold=10
).head(3)

print(fastest_category_to_trending)
print(fastest_channel_to_trending)
Fastest categories and channels to get into trending. Screenshot by author.

Interesting but kind of expected results – the late shows absolutely dominate the Trending tab of YouTube (at least in 2018). Jimmy Fallon gets an express ticket to trending, closely followed by SNL and The Ellen Show.

Categories & Channels That Stay in Trending

Besides getting into the Trending tab, it's also important who stays there the longest. This analysis is similar to the last one – we want to average some metric (days in trending) by group (e.g. category) but only for frequent group values (e.g. that appear more than 10 times). Hence, let's re-use the amazing function we've previously created.

longest_trending_categories = avg_frequent(
    time_to_trending_df,
    by="category_id",
    frequency_threshold=100,
    metric="days_in_trending",
).tail(3)  # tails because it's sorted in descending

longest_trending_channels = avg_frequent(
    time_to_trending_df,
    by="channel_title",
    frequency_threshold=10,
    metric="days_in_trending",
).tail(3)

print(longest_trending_categories)
print(longest_trending_channels)

Interestingly, there's no overlap between the categories. So even though it might take some time for a music clip to get into the trending, it's more likely to stay there for longer. The same goes for movie trailers and other entertainment content.

Trending Categories Over Time

So we know that the live-comedy shows get into the trending the fastest and music and entertainment videos stay there the longest. But has it always been the case? To answer this question, we need to create some rolling aggregates. Let's answer three main questions in this section:

  • What is the total number of trending videos per category per month?
  • What is the number of new videos per category per month?
  • How do the categories compare when it comes to views over time?

Total Number of Monthly Trending Videos per Category

First, let's look at the total number of videos per category per month. To get this statistic, we need to use .groupby_dynamic() method that allows us to group by the date column (specified as index_column ) and any other column of choice (specified as by parameter). The grouping frequency is controlled according to the every parameter.

trending_monthly_stats = df.groupby_dynamic(
    index_column="trending_date",  # date column
    every="1mo",  # can also me 1w, 1d, 1h etc
    closed="both",  # including starting and end date
    by="category_id", # other grouping columns
    include_boundaries=True,  # showcase the boudanries
).agg(
    pl.col("video_id").n_unique().alias("videos_number"),
)

print(trending_monthly_stats.sample(3))
Resulting resampled data frame. Screenshot by author.

You can see the resulting DataFrame above. Very nice property of Polars is that we can output the boundaries to sense check the results. Now, let's do some plotting to visualise the patterns.

plotting_df = trending_monthly_stats.filter(pl.col("category_id").is_in(top_categories))

sns.lineplot(
    x=plotting_df["trending_date"],
    y=plotting_df["videos_number"],
    hue=plotting_df["category_id"],
    style=plotting_df["category_id"],
    markers=True,
    dashes=False,
    palette='Set2'
)

plt.title("Total Number of Videos in Trending per Category per Month")
Number of videos plot. Generated by author.

From this plot we can see that Music has the largest share of Trending stating from 2018. This might indicate some strategic shift within YouTube to become the go-to platform for music videos. Entertainment seems to be on the gradual decline together with People & Blogs and Howto & Style categories.

Number of New Monthly Trending Videos per Category

The query is exactly the same, except now we need to provide as index_column the first the date when a video got into Trending. Would be nice to create a function here, but I'll leave this as an exercise for a curious reader.

trending_monthly_stats_unique = (
    time_to_trending_df.sort("first_day_in_trending")
    .groupby_dynamic(
        index_column="first_day_in_trending",
        every="1mo",
        by="category_id",
        include_boundaries=True,
    )
    .agg(pl.col("video_id").n_unique().alias("videos_number"))
)

plotting_df = trending_monthly_stats_unique.filter(pl.col("category_id").is_in(top_categories))
sns.lineplot(
    x=plotting_df["first_day_in_trending"],
    y=plotting_df["videos_number"],
    hue=plotting_df["category_id"],
    style=plotting_df["category_id"],
    markers=True,
    dashes=False,
    palette='Set2'
)

plt.title(" Number of New Trending Videos per Category per Month")
Number of new videos plot. Generated by author.

Here we get an interesting insights – the number of new videos by Entertainment and Music is roughly equal throughout the time. Since Music videos stay in Trending much longer, they are overrepresented in the Trending counts, but when these videos are deduped this pattern disappears.

Running Average of Views per Category

As the last step of this analysis, let's compare two most popular categories (Music and Entertainment) according to their views over time. To perform this analysis, we're going to use the 7 day running average statistic to visualise the trends. To calculate this rolling statistic Polars has a handy method called .groupby_rolling() . Before applying it though, let's sum up all the views by category_id and trending_date and then sort the DataFrame accordingly. This format is required to correctly calculate the rolling statistics.

views_per_category_date = (
    df.groupby(["category_id", "trending_date"])
    .agg(pl.col("views").sum())
    .sort(["category_id", "trending_date"])
)

Once the DataFrame is ready, we can use .groupby_rolling() method to create the rolling average statistic by specifying 1w in the period argument and creating an average expression in the .agg() method.

# Calculate rolling average
views_per_category_date_rolling = views_per_category_date.groupby_rolling(
    index_column="trending_date",  # Date column
    by="category_id",  # Grouping column
    period="1w"  # Rolling length
).agg(
    pl.col("views").mean().alias("rolling_weekly_average")
)

# Plotting
plotting_df = views_per_category_date_rolling.filter(pl.col("category_id").is_in(['Music', 'Entertainment']))
sns.lineplot(
    x=plotting_df["trending_date"],
    y=plotting_df["rolling_weekly_average"],
    hue=plotting_df["category_id"],
    style=plotting_df["category_id"],
    markers=True,
    dashes=False,
    palette='Set2'
)

plt.title("7-day Views Average")
Plot generated by author.

According to the 7-day rolling average views, Music completely dominates the Trending tab and starting from February 2018 the gap between these two categories has increased massively.

Conclusion

After finishing this post and following along the code you should get a much better understanding of advanced aggregate and analytic functions in Polars. In particular, we've covered:

  • Basics of working with pl.datetime
  • .groupby() aggregations with multiple arguments
  • The use of .over() to create aggregates over a specific group
  • The use of .groupby_dynamic() to generate aggregates over time windows
  • The use of .groupby_rolling() to generate rolling aggregates over period

Armed with this knowledge you should be able to perform almost every analytical task you have at the lightning speed.

You might have felt that some of this analysis felt very ad-hoc and you would be right. The next part is going to address exactly this topic – how to structure and create data processing pipelines. So stay tuned!

Not a Medium Member yet?

Join Medium with my referral link – Antons Tocilins-Ruberts

Tags: Analytics Data Analysis Data Engineer Data Science Tutorial

Comment