Exploratory Data Analysis: What Do We Know About YouTube Channels (Part 1)

Author:Murphy  |  View: 28703  |  Time: 2025-03-23 12:16:30
Photo by Glenn Carstens-Peters, Unsplash

Nowadays, there are more than 2.7 billion active YouTube users, and for many people, YouTube is not only entertainment but an important source of income. But how does it work? How many views or subscribers can different YouTube channels get? With the help of Python, Pandas, and the YouTube Data API, we can get some interesting insights.

Methodology

This article will be divided into several parts:

  • Using the YouTube Data API. With this API, we will be able to get a list of YouTube channels for different search requests. For each channel, we will get information about the number of videos, views, and subscribers.
  • Getting the list of channels we are interested in. This can be done only once.
  • Collecting the channel data. To get statistical insights, we need to collect the data for some period of time.
  • Data analysis.

Without further ado, let's get into it.

1. YouTube Data API

First, a piece of good news for everyone who is interested in collecting data from large networks like YouTube: the YouTube API is free, and we don't need to pay for it. To start using this API, we need two steps:

  • Open https://console.cloud.google.com and create a new project. I already had an old project there, but after some period of inactivity, all its API limits were set to zeros, and I did not find a way to reset them. So, it's just easier to make a new one.
Google Cloud Console, Image by author
  • Go to "APIs and Services" and enable "YouTube Data API". Open the API, go to "Credentials," and create an API key. If everything is done correctly, the Quotas page will look like this:
YouTube API Quotas, Image by author

That's it; after that, we can start making API requests to get YouTube data. As for limits, a free quota is 10,000 queries per day. Calculating this quota is a bit tricky because it's based on "internal" YouTube queries and not just on the number of API calls. Search requests are "heavy", and for example, getting a list of 500 channels for the phrase "smartphone review" will cost us about 7,000 "units". So, we can do only one search like this per day with one API key. But a free tier allows us to have 12 projects, with a separate quota for each project. So the task is easy, but we still need to keep the number of requests reasonably limited.

The data collection pipeline will consist of two types of API calls:

  • First, we will create a list of YouTube channels for different topics. This needs to be done only once.
  • Second, we can get the number of views and subscribers for each channel. I will be using Apache Airflow to run this task for at least a week, twice per day.

2. Getting YouTube Channels

In the first step, we enabled the YouTube API. Now, let's create a list of channels we are interested in. To do the search, I will be using the search_by_keywords method of the python-youtube library. As an example, the output for the query "cats" looks like this:

{
  "kind": "youtube#searchListResponse",
  "etag": "h_RGyvb98m0yrxBgG0Q21J0ch94",
  "nextPageToken": "CAIQAA",
  "regionCode": "UK",
  "pageInfo": {
    "totalResults": 19544,
    "resultsPerPage": 10
  },
  "items": [
    {
      "kind": "youtube#searchResult",
      "etag": "N6_OLAdw4hCq2.....",
      "id": {
        "kind": "youtube#channel",
        "channelId": "UCoV0b7wU....."
      },
      "snippet": {
        "publishedAt": "2016-11-07T04:54:33Z",
        "channelId": "UCoV0b7....",
        "title": "1 stoner 3 cats",
        "description": "MUST BE 18 OR OLDER FOR THIS CHANNEL...",
        "thumbnails": {
          "default": {
            "url": "https://yt3.ggpht.com/ytc/APkrFKZKfv..."
          },
          "medium": {
            "url": "https://yt3.ggpht.com/ytc/APkrFKZKfv..."
          },
          "high": {
            "url": "https://yt3.ggpht.com/ytc/APkrFKZKfvuGIwwg..."
          }
        },
        "channelTitle": "1 stoner 3 cats",
        "liveBroadcastContent": "upcoming",
        "publishTime": "2016-11-07T04:54:33Z"
      }
    },
    ...
  ],
  "prevPageToken": null
}

Here, we are interested in a title, channelId, and publishedAt parameters. We can also see the totalResults value, which is equal to 19544. Alas, the YouTube API was made for end-users and not for analytics. We cannot get all YouTube channels for a search query "cats"; this API returns only a list of 400–500 channels, somehow made by the YouTube recommender system.

We can use a simple program that makes the YouTube query for a specific phrase and saves the result into a CSV file:

import datetime
import logging
from pyyoutube import Api  # pip3 install python-youtube

def save_log(log_filename: str, s_data: str):
    """ Save string to the log file """
    with open(log_filename, "a", encoding="utf-8") as log_out:
        log_out.write(s_data + "n")

def search_by_keywords(api: Api, search_str: str, page_token: str):
    """ Get YouTube channels list for a search phrase """
    count = 10
    limit = 25000
    parts = ["snippet"]
    res = api.search_by_keywords(q=search_str, limit=limit, count=count,
                                region_code="UK",
                                relevance_language="en",
                                search_type="channel", 
                                order="title",
                                page_token=page_token, parts=parts,
                                return_json=True)
    return res

def get_channels(api: Api, search_str: str):
    """ Get YouTube channels list and save results in CSV file """
    time_str = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
    log_file = f"{search_str.replace(' ', '-')}-{time_str}.csv"
    logging.debug(f"Log file name: {log_file}")
    save_log(log_file, "channelId;publishedAt;title")

    res = search_by_keywords(api, search_str, page_token=None)
    next_page_token = res["nextPageToken"]
    num_items = 0
    while next_page_token is not None:
        for item in res['items']:
            title = item['snippet']['title'].replace(";", " ").replace("  ", " ")
            description = item['snippet']['description'].replace(";", " ").replace("  ", " ")
            log_str = f"{item['id']['channelId']};{item['snippet']['publishedAt']};{title} {description}"
            logging.debug(log_str)
            save_log(log_file, log_str)

            num_items += 1

        next_page_token = res["nextPageToken"]
        logging.debug(f"{num_items} items saved to {log_file}")

        res = search_by_keywords(api, search_str, page_token=next_page_token)
        next_page_token = res["nextPageToken"]

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG,
                        format='[%(asctime)-15s] %(message)s')

    key1 = "XXXXX"   
    youtube_api = Api(api_key=key1)
    get_channels(youtube_api, search_str="cats")

As an output, we will get a CSV like this:

channelId;publishedAt;title
UCoV0b7wUJ2...;2016-11-07T04:54:33Z;1 stoner 3 cats MUST BE ...
UCbm5zxzNPh...;2013-08-07T12:34:48Z;10 Cats ...
UCWflB-GzVa...;2013-09-25T10:39:41Z;13 Cats - Topic ...
UCiNQyjPsO9-c2C7eOGZhYXg;2023-10-09T22:51:37Z;2 CATS NO RULES ...

Now, we can do a search with different queries. This can be done only once; channel IDs are not changing. For the purpose of this article, I used these queries:

  • "Cats"
  • "Dogs"
  • "Makeup tutorial"
  • "Photography"
  • "Smartphone review"
  • "Street photography"

As a result, I saved a list of channels (about 500 records for each query) in a CSV file, and I had about 3000 YouTube channels in total.

3. Getting Channel Details

As a next step, we need to get statistics for each channel. To do this, I will use the method get_channel_info from the same python-youtube library:

def get_channel_info(api: Any,
                     file_out: str,
                     channel_id: str,
                     channel_title: str) -> int:
    """ Get YouTube channel statistics """
    res = api.get_channel_info(channel_id=channel_id, parts=["statistics"], return_json=True)
    n_count = 0
    if "items" in res:
        time_str = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
        for item in res["items"]:
            ch_id = item["id"]
            statistics = item["statistics"]
            views = statistics["viewCount"]
            subscribers = statistics["subscriberCount"]
            videos = statistics["videoCount"]
            s_out = f"{time_str};{ch_id};{channel_title};{views};{subscribers};{videos}"
            logging.debug(f"Saving: {s_out}")
            save_log(file_out, s_out)
            n_count += 1
    return n_count

A method can be used this way:

api = Api(api_key="...")
get_channel_info(api, "cats_09_24.csv",
                 channel_id="UCbm5zxzNPh...",
                 channel_title="CATS NO RULES Its a Cats Life")

As an output, we will have a CSV file with the needed values:

timestamp;channelId;title;views;subscribers;videos
2023-10-09-19-42-19;UCoV0b7wUJ2...;1 stoner 3 cats MUST BE ...;14;2;6
2023-10-09-19-42-19;UCbm5zxzNPh...;CATS NO RULES Its a Cats Life;24;5;3

Collecting The DataNow, we know how to get a list of YouTube channels and how to get channel details, like the number of views and subscribers. But it is interesting to see the dynamics and how these values are changing over time. YouTube has a separate Analytics API, which can be used for reports. However, as written in the API documentation, "the user authorizing the request must be the owner of the channel", so for our task, it is useless. The only way for us is to collect data for some time; 1–2 weeks looks like a good period of time.

Collecting the data can be done in different ways, and I decided to use Apache Airflow for that, which I installed on my Raspberry Pi. It turned out that the Raspberry Pi is an excellent Data Science tool for collecting data, which I have already used in several hobby projects. This $50 single-board computer has only 2W power consumption, is silent, has no fans, and runs a full-fledged Ubuntu on a 4-core CPU. The Raspbian OS configuration details are out of the scope of this article; those who are interested are welcome to read my previous TDS post:

Collecting Data with Apache Airflow on a Raspberry Pi

4. Exploratory Data Analysis

Preprocessing Finally, we are approaching the fun part of this article: let's see what kind of insights we can get from the collected data. I will use Pandas for data processing, Matplotlib, and Seaborn for drawing the graphs.

First, let's load the data we collected before. Files can be copied from the Raspberry Pi using the scp command (here, 10.14.24.168 is the device address, and "pi" is a standard Raspbian user name:

scp [email protected]:/home/pi/airflow/data/*.csv data

Apache Airflow was executing the code twice per day, saving a separate CSV file with timestamps after each run. After a week, I got a bunch of CSV files with about 80K total records. Let's load all files and combine them together into the Pandas dataframe:

Python">import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

channel_files = glob.glob("data/channel*.csv")
channels_data = []
for file_in in channel_files:
    channels_data.append(pd.read_csv(file_in, delimiter=';',
                                     parse_dates=['timestamp'],
                                     date_format="%Y-%m-%d-%H-%M-%S"))

df_channels = pd.concat(channels_data)

The result looks like this:

Dataframe with time series data, Image by author

As a reminder, at the beginning of the article, I also collected a list of channels for different search requests ("Smartphones," "Cats," "Dogs," etc.). Let's load this list into a second dataframe:

def load_channels(files: List, subject: str) -> pd.DataFrame:
    """ Load and combine dataframe from several files """
    dataframes = []
    for csv in files:
        df = pd.read_csv(csv, delimiter=";", parse_dates=["publishedAt"])
        df["subject"] = subject
        dataframes.append(df)
    return pd.concat(dataframes).drop_duplicates(subset=["channelId"])

smartphones = load_channels(["smartphone-channels.csv"], subject="Smartphones")
dogs = load_channels(["dogs-channels.csv"], subject="Dogs")
cats = load_channels(["cats-channels.csv"], subject="Cats")
...

channels_all = pd.concat([smartphones, makeup, photography,
                          streetphotography, cats,
                          dogs]).drop_duplicates(subset=["channelId"])

Loading the channel list could be automated, but I have only 6 categories, so it was straightforward to simply hardcode them all. I also added a "subject" column to keep the category name (it is important to mention that the "subject" is not the "official" channel category given by its owner, but the name I used during the search request).

At this moment, we have two Pandas data frames: one contains the basic channel data (id, title, and creation date), and the second has time-series data with the number of views, videos, and subscribers. Let's merge these data frames together, using the channelId as a key:

df_channels = df_channels.merge(
                  channels_all[["channelId", "publishedAt", "subject"]],
                  on=['channelId'],
                  how='left') 

Now, we are ready to have fun! Let's visualize different types of data and draw them with Seaborn and Matplotlib.

4.1 Number of Views and SubscribersAs a warm-up, let's sort YouTube channels by the number of views:

df_channels_ = df_channels.drop_duplicates(subset=["channelId"]).sort_values(by=['views'], ascending=False).copy()
df_channels_["views"]  = df_channels_["views"].apply(lambda val: f"{val:,.0f}")
df_channels_["subscribers"]  = df_channels_["subscribers"].apply(lambda val: f"{val:,.0f}")
display(df_channels_)

The result looks like this:

YouTube channels, sorted by number of views, Image by author

We can see a very large difference between values. The top channels on the list have literally billions of views and millions of subscribers. The numbers were actually so large that I had to add a thousand "," separators to the column!

As an aside note, why did I not use a Pandas Styler object for that? Indeed, it is easy to write this code:

display(df_channels_.style.format(thousands="."))

It turned out that it works well on a small dataframe. But at least, in Visual Studio Code, after changing the style, the dataframe is not displayed as a head, tail, and "…" anymore, and Visual Studio always shows all 3030 rows. If someone knows a solution, please write it in the comments below.

It's nice to see a dataframe, but the result will be much more clear in graphical form. Let's draw the number of views using a barplot:

decimation = 10
df_channels__ = df_channels_.reset_index(drop=True).iloc[::decimation, :]

sns.set(rc={'figure.figsize': (18, 6)})
sns.set_style("whitegrid")
fig, ax = plt.subplots()
sns.barplot(df_channels__, x=df_channels__.index, y="views", width=0.9, ax=ax)
ax.set(title='YouTube channels views',
       xticks=range(0, df_channels__.shape[0], 50),
       ylim=(0, None),
       xlabel='Channel №',
       ylabel='Views Total')
ax.ticklabel_format(style='plain', axis="y")
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
sns.despine(left=True)
plt.show()

The drawing is easy, but some small tweaks were required. Again, I used a FuncFormatter to add "," thousand dividers; otherwise, numbers are too large and not convenient to read. I also added a decimation=10 parameter to reduce the number of records in the dataframe; otherwise, the vertical bars were too small. Still, we can see that the area is almost empty:

Obviously, it is easy to adjust the vertical scale by using a ylim parameter, but I specifically left it like this so readers can see the real difference between "top" and "other" channels. The distribution is very strongly skewed. Several top channels have literally billions of views, and others compared to them are just not visible. From my list of about 3,000 channels, 5% of the top channels have 95% of total views.

We can also draw the number of subscribers, and its shape looks the same as the previous one:

Let's get more quantitatively accurate data using percentiles:

display(df_channels_[["views", "subscribers"]].quantile(
    [0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]
).style.format("{:,.0f}"))

The output looks like this:

Quantiles data, Image by author

A 50th percentile (or 0.5 quantile) is a number, showing us that 50% of all values lie below that number. For example, the 50th percentile for all subscriber values is only 16. It means that despite these Googleplex-like figures in the top, 50% of the channels in my list have less than 16 subscribers! It can be surprising, but we can easily verify this by sorting the dataframe by the number of subscribers and looking at the middle:

df_channels_ = df_channels.drop_duplicates(subset=["channelId"]).sort_values(by=['subscribers'], ascending=False).reset_index(drop=True)
display(df_channels_[df_channels_.shape[0]//2:])

The result confirmed that the table above is correct:

A middle part of the dataframe, Image by author

All these values can give us an idea of the number of views and subscribers we can expect. But here, I analyzed only the 3030 channels I collected. Can we get a total number of YouTube channels with, let's say, 1M and 100K subscribers? I did not find an answer, and it's probably one of the YouTube secrets, the same as the real ratio between male and female users on Tinder;) Apparently, the YouTube recommender system has an algorithm for mixing "top" and "other" channels together in search results, giving newbies a chance to be seen by viewers.

4.2 Number of Subscribers per Registration DateIt is interesting to know that a particular YouTube channel has 1,000,000 views or subscribers, but how fast can channel owners reach this value? In the YouTube Data API, every channel has a "publishedAt" parameter, which represents a channel's creation date. We cannot get historical data for a particular channel, but we can compare channels with different creation dates using a scatter plot. I will also separate different categories with different colors and add average lines.

upper_limit = 1_000_000

df_channels_ = df_channels.drop_duplicates(subset=["channelId"]).copy()
df_channels_["subscribers_clipped"] = df_channels["subscribers"].clip(upper=upper_limit)

sns.set(rc={'figure.figsize': (18, 8)})
sns.set_style("white")
palette = sns.color_palette("bright")

fig, ax = plt.subplots()
# Add scatter plot and average lines
for ind, subj_str in enumerate(df_channels_["subject"].unique()):
    df_subj = df_channels_[df_channels_["subject"] == subj_str]
    # Draw scatter plot
    markers = ["o" , "s" , "p" , "h"]
    sns.scatterplot(data=df_subj, x="publishedAt", y="subscribers_clipped",
                    color=palette[ind],
                    marker=markers[ind % len(markers)],
                    label=subj_str,
                    ax=ax)

    # Draw average
    col_avg = df_subj["subscribers"].mean()
    linestyles = ["--", ":", "-."]
    linestyle = linestyles[ind % len(linestyles)]
    ax.axhline(col_avg, color=palette[ind], label=subj_str + " Avg", linestyle=linestyle, linewidth=1.0, alpha=0.6)

ax.set(title='Channel Subscribers',
       xlabel='Registration Date',
       ylabel='Subscribers',
       ylim=(0, upper_limit)
       )
ax.ticklabel_format(style='plain', axis="y")
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=12))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
ax.spines['top'].set_color('#EEEEEE')
ax.spines['right'].set_color('#EEEEEE')
ax.spines['bottom'].set_color('#888888')
ax.spines['left'].set_color('#888888')
plt.legend(loc='upper right')
plt.show()

The result is much more informative compared to a previous bar chart:

Number of subscribers distribution, Image by author

1 million subscribers is a sort of "landmark" for many YouTube channels, and I set this value as a clipping limit for the graph. We can see that the "youngest" YouTube channel in my list reached this point at the beginning of 2022, so it took almost two years for the channel owners to do it (this analysis was made at the end of 2023). At the same time, there are some "old" channels, made even before 2010, that still do not reach 100,000 subscribers today.

As for the average values, they are also interesting. As we can see, more people are subscribed to the "Smartphone"-related channels, and the second popular category is "Makeup". Let's "zoom" a graph a bit more:

Number of subscribers distribution, Image by author

Here, we can see that the categories "Cats" and "Dogs" are on average much less popular (almost 10 times). Categories "Photography" and "Street photography" are even more niche, and even getting 100,000 subscribers can be a challenging goal for these channels.

4.3 Number of Subscribers per VideoThis question can be interesting for those who want to start their own YouTube channel. How many videos should be published to get a certain number of views or subscribers? We know the number of videos and subscribers per channel and can find the answer by using a scatter plot. I will also use a linear regression model to draw average lines:

from sklearn.linear_model import LinearRegression
import numpy as np

df_channels_ = df_channels.drop_duplicates(subset=["channelId"]).copy()

upper_limit = 100_000
right_limit = 1000

sns.set(rc={'figure.figsize': (18, 8)})
sns.set_style("white")
num_subjects = df_channels_["subject"].nunique()
palette = sns.color_palette("bright")
fig, ax = plt.subplots()
for ind, subj_str in enumerate(df_channels_["subject"].unique()):
    # Filter by subject
    df_subj = df_channels_[df_channels_["subject"] == subj_str].sort_values(by=['subscribers'], ascending=False)
    # Draw scatter plot
    markers = ["o" , "s" , "p" , "h"]
    sns.scatterplot(data=df_subj, x="videos", y="subscribers",
                    color=palette[ind],
                    # palette=[palette[ind],
                    # hue="subject", 
                    marker=markers[ind % len(markers)],
                    label=subj_str,
                    ax=ax)

    # Make linear interpolation
    df_subj = df_subj[10:]   # Optional: remove top channels to exclude "outliers"
    values_x = df_subj["videos"].to_numpy().reshape((-1, 1))
    values_y = df_subj["subscribers"].to_numpy()
    model = LinearRegression().fit(values_x, values_y)
    x_val = np.array([0, right_limit])
    y_val = model.predict(x_val.reshape((-1, 1)))    
    # Draw
    linestyles = ["--", ":", "-."]
    ax.axline((x_val[0], y_val[0]), (x_val[1], y_val[1]),
              linestyle=linestyles[ind % 3], linewidth=1,
              color=palette[ind], alpha=0.5,
              label=subj_str + " Avg")

ax.set(title='YouTube Subscribers',
       xlabel='Videos In Channel',
       ylabel='Subscribers',
       xlim=(0, right_limit),
       ylim=(0, upper_limit)
       )
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
ax.spines['top'].set_color('#EEEEEE')
ax.spines['right'].set_color('#EEEEEE')
ax.spines['bottom'].set_color('#888888')
ax.spines['left'].set_color('#888888')
plt.legend(loc='upper right')
plt.show()

Here, I limited the values to 100,000 subscribers and 1,000 videos. I also excluded the top 10 channels from the linear interpolation to make the average results more realistic.

The output looks like this:

Number of subscribers within 0–100K range, Image by author

Again, we can see that the "Makeup" and "Smartphones" channels are getting the highest number of subscribers per video. Average lines for "Cats" and "Dogs" are almost horizontal. How can it be? First, as we saw in the previous picture, the average number of subscribers for this category is generally lower. Second, I can guess that more people are publishing videos with cats and dogs, and the distribution is more skewed.

How about the top of the distribution? Well, there are enough channels, with >1M subscribers and less than 1000 videos:

Number of subscribers within 0–10M range, Image by author

I suppose these are professional studios with high-end cinematic equipment, and their budgets are pretty high. And how about the low part of the distribution? Let's see another graph:

Number of subscribers within the lowest range, Image by author

I was surprised to see YouTube channels with 1,000–5,000 videos and only 10–50 subscribers. It turned out that many of these channels are probably auto-generated by bots; they have only playlists and no videos, mostly no views, and no subscribers at all. What is the purpose of these channels? I don't know. Some other channels belong to real people, and it is a bit sad to see when someone has posted >1000 videos and each one has only 10–20 views per year.

4.4 Channel Dynamics – Views per DayAs we know, using the public YouTube API, we can only get the number of views and subscribers at the current moment, and only the owner can get the historical data. As a workaround, I collected data for a week with a Raspberry Pi and Apache Airflow. Now, it's time to see what we can get.

The processing in this case is a bit more tricky. I need to get every channel, sort its data by the timestamp, and calculate differential values:

channels_data = []

channel_id = ...
df_channel_data = df_channels[df_channels["channelId"] == channel_id][["timestamp", "views", "subscribers", "videos"]].sort_values(by=['timestamp'], ascending=True).reset_index(drop=True).copy() 
df_first_row = df_channel_data.iloc[[0]].values[0]        
df_channel_data = df_channel_data.apply(lambda row: row - df_first_row, axis=1)
df_channel_data["channelId"] = channel_id
df_channel_data["days_diff"] = df_channel_data["timestamp"].map(lambda x: x.total_seconds()/(24*60*60), na_action=None)
df_channel_data[subj_str] = subj_str
channels_data.append(df_channel_data)

Here, I use the apply method to calculate the difference between the first and other values in the dataframe. Then, I can draw the data with a lineplot:

sns.lineplot(data=pd.concat(channels_data),
             x="days_diff", y="views",
             hue="channelId", palette=palette, linestyle=linestyle,
             legend=False)

(the full code is longer; for clarity reasons, I keep only the essential parts)

As we already know, the distribution is skewed. The result for the top 50 channels looks like this:

Top 50 channel views per week, Image by author

As we can see, top channels can have more than several million views per day!

How is it going at the right part of the distribution? In total, I collected 3,030 channels, and this is the same graph for the 1,000 of them from the right side:

1000 YouTube channel views per week, Image by author

The results here are much less encouraging. Some channels got 50–100 new views per week, but most of the channels got only 10–20 views at all. The YouTube search is limited to about 500 items, but I can guess that most YouTube users never scroll to more than the first 1–2 pages.

4.5 Channel Dynamics – Subscribers per DayLet's see how the number of subscribers is changing. The code is the same, except that I used a "subscribers" column instead of "views".

The results are interesting. First, let's see the top 50 channels from my list:

New channel subscribers per week, Image by author

As we can see, top channels can get several thousand new subscribers per day! At the right part of the distribution, the results are not so exciting again but still interesting:

New channel subscribers per week, Image by author

One of the channels "suddenly" got 100 subscribers per day, but this value did not increase anymore. Maybe the owner paid for the promotion, or one of the videos went viral – who knows? Other channels got only 5–10 new subscribers per week.

4.6 Channel Dynamics – Videos per DayIt is also interesting to know how many videos per day are published by different channels. We can easily find an answer using the same code. First, let's see a number of new videos from the top 50 channels:

New videos per day, Image by author

Here are the 1000 channels from the right part of my list:

New videos per day, Image by author

Interestingly, the numbers are not drastically different. But the top channels are apparently publishing fewer videos, and they definitely prefer quality to quantity. They can make only one video per week, and each video may have >1M views. However, there are some YouTube channels that have 5000+ videos in total; they publish several videos per day. Anyway, none of these channels are at the top, which is interesting to think about.

A "spaghetty graph" can show us a general trend, but it's hard to read values from it. To get more precise data, we can draw a histogram for the top 50 channels:

New videos per week, Image by author

As we can see, some channels are publishing more than one video per day, but the majority of the top channels are making only one or even fewer videos per week. Obviously, there is no universal formula that fits all the genres, and videos about cats or about smartphone or camera reviews may require absolutely different times of preparation. Readers are welcome to filter channels by different categories and do more detailed analysis on their own.

5. Bonus: Anomaly Detection

Finally, a small bonus for readers who were patient enough to read until this part. Let's apply the anomaly detection algorithm and see if we can find some unusual YouTube channels. I will be using the unsupervised IsolationForest algorithm for that. The algorithm itself is based on binary decision trees. At every step, the tree is branching using the random feature and a random threshold until each point is completely isolated or the maximum depth is reached. After that, the "anomaly score" is assigned to each point according to the depth of the tree required to achieve that point.

I will use the number of views and subscribers per video as a metric. I also set a contamination value to 0.05; this is our desired proportion of outliers.

from sklearn.ensemble import IsolationForest

df_channels_ = df_channels.sort_values(by=['videos'], ascending=False).drop_duplicates(subset=["channelId"]).copy().reset_index(drop=True).copy()
df_channels_ = df_channels_[df_channels_["videos"] > 10]
df_channels_["subscribers_per_video"] = df_channels_["subscribers"]/df_channels_["videos"]
df_channels_["views_per_video"] = df_channels_["views"]/df_channels_["videos"]
df_channels_[["subscribers_per_video", "views_per_video"]] = df_channels_[["subscribers_per_video", "views_per_video"]].apply(pd.to_numeric)

X = df_channels_[["subscribers_per_video", "views_per_video"]]
model = IsolationForest(contamination=0.05, random_state=42).fit(X)
df_channels_['anomaly_scores'] = model.decision_function(X)
df_channels_['anomaly'] = model.predict(X)

# Anomaly: Outlier (-1) or an inlier (1)
# Anomaly_scores: The lower the score, the more abnormal is the sample
display(df_channels_.sort_values(by=['anomaly_scores'], ascending=True)[:30])

Let's sort the channels by anomaly score. The result looks like this:

In the first place of our "anomaly rating", we see a channel from the category "Cats", which indeed has a high number of subscribers per video. I watched this channel; I am not a fan of videos about cats, but technically it was indeed good. This was also probably the first time I saw a video that has 193M views (I must admit that no video about math or machine learning will ever get to this point;). The second channel in my "rating" was about makeup. I am absolutely not an expert in that area, and I was going to skip it, but one video still got my attention. The author was asking ChatGPT to write the makeup procedure. I was never thinking about using AI for makeup, though it is interesting to see how AI affects more and more areas of our lives.

Sometimes it is easy to guess why the item has a high anomaly rating, but if the number of features is large, it can be complicated. In such cases, we can use the SHAP library to visualize the results:

import shap

X = df_channels_[["subscribers_per_video", "views_per_video"]]
y_pred = model.predict(X)
explainer = shap.Explainer(model.predict, X)
shap_values = explainer(X)

shap.initjs()

The explainer method uses Shapley values to explain different machine learning models, and it can work with the IsolationForest as well. After initialization, we can check different items in our list. Let's examine the first one:

shap.plots.waterfall(shap_values[786])

The result looks like this:

Shapley Explainer results, Image by author

In this another example, the views_per_video parameter looks normal, but the subscribers_per_video value is high:

Shapley Explainer results, Image by author

In this case, we can see that both metrics are unusually high.

Conclusion

In this article, I explained how to get YouTube channel data using the YouTube Data API and a python-youtube library. This data allows us to make YouTube search requests for different categories and get interesting statistical insights about YouTube channels.

I suppose every reader of this story has watched at least one YouTube video today or yesterday. According to demandsage.com, YouTube is the second-biggest search engine after Google, with 2.7B active users in 2023. It is a part of our modern society and a part of everyday life. Thus, from cultural and research perspectives, it is interesting to know which categories are most popular and how many views and subscribers different channels can get. In this article, I used "neutral" categories like "Cats" or "Dogs", but the same approach can be used for collecting data about politics, war, medicine, conspiracy theories, or any other topics. Last but not least, for many content creators, YouTube is an important source of income, and it can be crucial to know what kind of views or subscribers different categories can get. So, I encourage you, as a reader, to do the same tests on the topics you are interested in. Anyway, statistics is a science about us.

In the second part of this story, I will focus on individual videos. We will see how often different YouTube channels publish the videos, and how many views these videos can get:

Exploratory Data Analysis: What Do We Know About YouTube Channels (Part 2)

Those who are interested in social data analysis are also welcome to read other articles:

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. If you want to get the full source code for this and my next posts, feel free to visit my Patreon page.

Thanks for reading.

Tags: Data Science Data Visualization Deep Dives Programming Python

Comment