Stars of the 2024 Paris Olympics

Author:Murphy  |  View: 24078  |  Time: 2025-03-23 11:51:03

We just witnessed a wonderful three weeks of sports as the 2024 Olympic Games unfolded in Paris, with millions of people watching the streams and rooting for their favorites. Simultaneously, we saw household names crushing it again, and the next generation of stat athletes emerged.

As a curious data scientist, I started to wonder how the collective opinion behind these evolves – which are the most popular sports, and which athletes are on the rise? Then, to put data behind this, I decided to do an extensive data collection from Wikipedia, containing both Wikipedia profiles and view count information, and then compare the popularity of different sports as well as top athletes. Below, please find my results and all the Python code needed to reproduce these.

All images were created by the author.

Olympic Sports on Wikipedia

First, let's visit the Wikipedia site of the 2024 Summer Olympics, and download it using the requests library. Then, I use the BeautifulSoup package to extract information from the html – namely, the list of all summer sports and their Wiki site links. To extract the sports, additionally, I follow my manual observation – the sports start the listing by Artistic swimming and close with Wrestling.

Now, let's see how to perform these steps and extract both the name and the Wikipedia profile url of each of this year's summer games' sports:

# libraries needed for the scraping
import requests
from bs4 import BeautifulSoup
import re

# the source urls
url = 'https://en.wikipedia.org/wiki/2024_Summer_Olympics'

# Send a request to the website and get the HTML content
response = requests.get(url)
html_content = response.content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract summer Sports
is_sport = False
sports_urls = {}

for res in soup.find_all('a', href=True):

    res_text = str(res)
    if 'Artistic swimming' in res_text:
        is_sport = True

    if is_sport:
        url = 'https://en.wikipedia.org/' + res['href']
        print(res.text, ' -- ', url)
        sports_urls[res.text] = url

    if 'Wrestling' in res_text:
        break

This code block will output the name and the reference Wikipedia url belonging to this year's Olympic Games for all summer Olympic sports as follows:

The Overall Popularity of Different Sports

After we have the complete list of this year's Olympic sports as well as their current Wikipedia profiles, let's figure out their popularity. We can do so by using the mwviews library, which provides a flexible API to download view count information of Wikipedia sites on various time scales and resolutions.

I set the end of the data collection period to the end of the Games, while to be able to collect a proper baseline, I started it two months earlier. Additionally, I used the daily resolution to collect the information.

# init the API
from mwviews.api import PageviewsClient
import pandas as pd

p = PageviewsClient(user_agent="[email protected]> Sport analysis")
domain = 'en'

# download the data
sports_data = {}
sports_count = {}

for sport, url in sports_urls.items():

    page = url.split('wiki/')[-1]
    data = []
    for a,b in p.article_views(domain + '.wikipedia', [page], granularity='daily', start='20240611', end='20240811').items():
        data.append({'date' : a, 'count' : b[page]})

    df = pd.DataFrame(data)
    print(sport, ' -- ', sum(df['count']))
    df['page'] = page
    sports_data[sport] = df
    sports_count[sport] = sum(df['count'])

Ths code block downloads the daily view count of each sport's Wikipedia site and stores them in the dictionary sports_data, while outputs the total number of view counts of each sport throughout this period as follows:

Let's visualize these view count information on a bar chart and creat an overall popularity comparison of this year's Olympic sports:

import matplotlib.pyplot as plt
import numpy as np

# Sorting the data by values in descending order
sorted_sports_data = dict(sorted(sports_count.items(), key=lambda item: item[1], reverse=True))

# Extracting the keys and values
sports = list(sorted_sports_data.keys())
values = list(sorted_sports_data.values())

# Creating the bar chart
fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.Set1(np.linspace(0, 1, len(sports)))  # Using a colormap appropriate for Olympics colors

bars = ax.barh(sports, values, color=colors)
ax.set_xlabel('Values')
ax.set_title('Olympic Sports Data')

# Inverting the y-axis to show the highest value on top
ax.invert_yaxis()

plt.show()

Sport's Popularity Time Series

While the overall comparison shows clear winners, such as football, athletics, and basketball, the temporal evolution of these sports can be just as interesting and revealing for fans. So let's take advantage of the daily resolution of our data, and visualize the time series of each sport:

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

# Use seaborn style for better aesthetics
sns.set(style='whitegrid')

# Create a larger figure for better readability
f, ax = plt.subplots(1, 1, figsize=(12, 8))

# Use a colormap for the Olympic Games
olympic_colors = sns.color_palette("Set3", n_colors=len(sports_data))

# Plot each sport with a corresponding color from the colormap
for (sport, data), color in zip(sports_data.items(), olympic_colors):
    ax.plot(data['date'], data['count'], label=sport, color=color)

# Adjust the legend to avoid overlapping with the plot
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=2)

# Set plot titles and labels
ax.set_title('Sports Participation Over Time', fontsize=16)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Count', fontsize=14)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.tight_layout()
plt.show()

This graph shows very interesting patterns that seem to hold for each sport. As soon as they kicked off, the popularity spiked super fast; however, after a quick saturation, as soon as the races were over, so did Wikipedia's interest. Let's have a closer look at the top 10.

df_sport_ranking = pd.DataFrame(sports_count.items(), columns = ['sport', 'Wiki view count']).sort_values(by = 'Wiki view count', ascending = False)
top10 = set(df_sport_ranking.head(10).sport)

sns.set(style='whitegrid')

f, ax = plt.subplots(1, 1, figsize=(12, 8))

# Use a colormap for the Olympic Games
olympic_colors = sns.color_palette("Set3", n_colors=len(sports_data))

# Plot each sport with a corresponding color from the colormap
for (sport, data), color in zip(sports_data.items(), olympic_colors):
    if sport in top10:
        ax.plot(data['date'], data['count'], label=sport, color=color)

# Adjust the legend to avoid overlapping with the plot
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=2)

# Set plot titles and labels
ax.set_title('Sports Participation Over Time', fontsize=16)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Count', fontsize=14)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.tight_layout()

Medal Winners

Moving on with the popularity and success analysis of the Paris 2024 Games, let's get the information on medal winners. Luckily, Wikipedia has a summary page on that as well, which I will collect and process in the following code block. The code block will return the lists of athletes who won gold, silver, and bronze medals.

medal_url = 'https://en.wikipedia.org/wiki/List_of_2024_Summer_Olympics_medal_winners'

response = requests.get(medal_url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
def get_url(text):
    soup_text = BeautifulSoup(str(text), 'html.parser')
    athlete_links = soup_text.find_all('a', href=True)
    athlete_links = [a for a in athlete_links if '2024' not in str(a)]
    return athlete_links

def contains_numbers(string):
    return bool(re.search(r'd', string))   

def add_medalists(medal_list, medal_html):
    for athlete_link in get_url(medal_html):
        medal_list.append((athlete_link.text, 'https://en.wikipedia.org/' + athlete_link['href']))

url = 'https://en.wikipedia.org/wiki/List_of_2024_Summer_Olympics_medal_winners'

# Send a request to the webpage
response = requests.get(url)
html_content = response.content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Find all the tables with medal winners
tables = soup.find_all('table', class_='wikitable')

# Initialize lists to hold names and URLs of gold medal winners
golds = []
silvers = []
bronzes = []

# Iterate over each table to find gold medal winners
for idx, table in enumerate(tables):

    rows = table.find_all('tr')
    data = []
    for row in rows:
        cells = row.find_all('td')

        if len(cells)==4:
            event, gold, silver, bronze = cells
            add_medalists(golds, gold)
            add_medalists(silvers, silver)
            add_medalists(bronzes, bronze)

        if len(cells)==7:
            gold, silver, bronze = [c for c in cells][1::2]
            add_medalists(golds, gold)
            add_medalists(silvers, silver)
            add_medalists(bronzes, bronze)

        if len(cells)==6:
            print([c.text for c in cells])    

for athlete_link in get_url(gold):
    golds.append((athlete_link.text, 'https://en.wikipedia.org/' + athlete_link['href']))

for athlete_link in get_url(silver):
    silvers.append((athlete_link.text, 'https://en.wikipedia.org/' + athlete_link['href']))

for athlete_link in get_url(bronze):
    bronzes.append((athlete_link.text, 'https://en.wikipedia.org/' + athlete_link['href']))

The resulting number of medals, and a quick example for validation:

# Total number of medal-winners
print('Number of gold medalists: ', len(golds))
print('Number of silver medalists: ', len(silvers))
print('Number of bronze medalists: ', len(bronzes))
print()

# Double-checking the 3 golds and 1 silver of Biles
print('Golds:')
for (name, link) in golds:
    if 'Simone B' in name:
        print(link)

print('Silvers:')
for (name, link) in silvers:
    if 'Simone B' in name:
        print(link)

The output:

Medal Winners' Popularity

Finally, let's put all of the previous parts together and draw up the popularity time series of medal-winning athletes. First, let's see who were the most popular during the Games.

athletes_links = {}

for athlete, link in golds: athletes_links[athlete] = link
for athlete, link in silvers: athletes_links[athlete] = link
for athlete, link in bronzes: athletes_links[athlete] = link

print()
print('Number of medal winning atheltes: ', len(athletes_links))
print()

From this code cell: Number of medal winning athletes: 1959

atheletes_data = {}
atheletes_count = {}

for idx, (athlete, url) in enumerate(athletes_links.items()):

    if idx % 100 == 0:
        print(idx)

    try:

        page = url.split('wiki/')[-1]
        data = []
        for a,b in p.article_views(domain + '.wikipedia', [page], granularity='daily', start='20240611', end='20240811').items():
            data.append({'date' : a, 'count' : b[page]})

        df = pd.DataFrame(data)
        #print(athlete, ' -- ', sum(df['count']))
        df['page'] = page
        atheletes_count[sport] = df
        atheletes_data[sport] = sum(df['count'])

    except:
        pass

print('Number of medal-winning athletes with measurable Wiki popularity: ', len(atheletes_data))

After collecting the popularity profile of more than 1500 athletes, let's see the top 20:

df_athlete_ranking = pd.DataFrame(atheletes_count.items(), columns = ['athlete', 'Wiki view count']).set_index('athlete').sort_values(by = 'Wiki view count', ascending = False)
df_athlete_ranking.head(20)

Let's further zoom in, and plot the top 98, which highlights well the timing of the different events as well. This plot is more of an illustration as I removed the axis – however, it offers an interesting overview and allows you to pick your favourite athlete.

Conclusion

This brief piece aimed to show how to collect collective popularity information from Wikipedia and apply it to compare various entities – in this case, Olympic sports and athletes. While the topic is timely and exciting, these methods can be further applied in mapping knowledge of any sort on Wikipedia, from famous persons to trending scientific fields, and be harnessed from academic research to market research.

Tags: Data Science Data Visualization Hands On Tutorials Olympics Sports

Comment