Build a Better Bar Chart with This Trick

Author:Murphy | View: 21356 | Time: 2025-03-23 12:55:23

Part of an "Age of Congress" scatter plot (all images by the author)

Whenever I need inspiration for effective visualizations, I browse The Economist, the Visual Capitalist, or The Washington Post. During one of these forays, I ran across an interesting infographic – similar to the one shown above – that plotted the age of each member of the US Congress against their generational cohort.

My first impression was that this was a horizontal bar chart, but closer inspection revealed that each bar was composed of multiple markers, making it a scatter plot. Each marker represented one member of Congress.

In this Quick Success Data Science project, we'll recreate this attractive chart using Python, pandas, and Seaborn. Along the way, we'll unlock a cornucopia of marker types you may not know exist.

Dataset

Because the United States has _Age of Candidacy laws, the birthdays of members of Congress are part of the public record. You can find them in multiple places, including the Biographical Directory of the United States Congress_ and Wikipedia.

For convenience, I've already compiled a CSV file of the names of the current members of Congress, along with their birthdays, branch of government, and party, and stored it in this Gist.

The Code

The following code was written in Jupyter Lab and is described by cell.

Importing Libraries

from collections import defaultdict  # For counting members by age.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import patches  # For drawing boxes on the plot.
import pandas as pd
import seaborn as sns

Assigning Constants for the Generational Data

We'll annotate the plot so that generational cohorts, such as Baby Boomers and Gen X, are highlighted. The following code calculates the current age spans for each cohort and includes lists for generation names and highlight colors. Because we want to treat these lists as constants, we'll capitalize the names and use an underscore as a prefix.

# Prepare generational data for plotting as boxes on chart:
CURRENT_YEAR = 2023
_GEN_NAMES = ['Silent', 'Boomers', 'Gen X', 'Millennials', 'Gen Z']
_GEN_START_YR = [1928, 1946, 1965, 1981, 1997]
_GEN_END_YR = [1945, 1964, 1980, 1996, 2012]  
_GEN_START_AGE = [CURRENT_YEAR - x for x in _GEN_END_YR]
_GEN_END_AGE = [CURRENT_YEAR - x for x in _GEN_START_YR]
_GEN_COLORS = ['lightgray', 'white', 'lightgray', 'white', 'lightgray']

Converting Birthdays into Ages

To calculate each member's age, we first must convert a reference date (8/25/2023) and the DataFrame's "Birthday" column to datetime format using pandas' to_datetime() method.

Now that we have compatible, "date aware" formats, we can generate an "Age" column by subtracting the two values, extracting the number of days, and then converting the days to years by dividing by 365.25.

# Load the data:
df = pd.read_csv('https://bit.ly/3EdQrai')

# Assign the current date:
current_date = pd.to_datetime('8/25/2023')

# Convert "Birthday" column to datetime:
df['Birthday'] = pd.to_datetime(df['Birthday'])

# Make a new "Age" column in years:
df['Age'] = ((current_date - df['Birthday']).dt.days) / 365.25
df['Age'] = df['Age'].astype(int)

df.head(3)

Counting the Ages of the Members

We'll ultimately want to group the members by party and branch of government. That means we'll need to generate four separate plots. (We'll include the 3 independents with the Democrats, with whom they caucus).

Unlike with a simple bar chart, we'll need to know more than just the total number of, say, Republican senators with an age of 57 years. Because we want to plot a separate mark for each member in a specific age category, we need a running total. This way, we can use (count, age) values as the (x, y) coordinates in our scatterplot. So, the first Republican senator with an age of 57 will be assigned a "1" in a count column, the second senator will be assigned a "2," and so on.

To manage this, we'll first set up four DataFrame columns to hold the counts, then make four corresponding dictionaries to record the initial counts. We'll use the collections module's [defaultdict()](https://docs.python.org/3/library/collections.html#defaultdict-objects) container, rather than a standard dictionary, as it will provide a default value for a key that doesn't exist, rather than raising an annoying KeyError.

Next, we'll iterate through our DataFrame, filtering on the "Branch" and "Party" columns. We'll update the "Age" column each time we increment the dictionary. This allows us to keep a running count of matching ages.

Note that we use negative values for the Democrat counts, as we want them to plot to the left of a central axis, while Republican values plot to the right.

# Initialize count columns:
df['R count house'] = 0
df['D count house'] = 0
df['R count senate'] = 0
df['D count senate'] = 0

# Create dictionaries with default values of 0:
r_count_h_dict = defaultdict(int)
d_count_h_dict = defaultdict(int)
r_count_s_dict = defaultdict(int)
d_count_s_dict = defaultdict(int)

# Iterate through the DataFrame and update counts:
for index, row in df.iterrows():
    age = row['Age']
    if row['Branch'] == 'House':
        if row['Party'] == 'R':
            r_count_h_dict[age] += 1
            df.at[index, 'R count house'] = r_count_h_dict[age]
        elif row['Party'] == 'D':
            d_count_h_dict[age] -= 1
            df.at[index, 'D count house'] = d_count_h_dict[age]
    elif row['Branch'] == 'Senate':
        if row['Party'] == 'R':
            r_count_s_dict[age] += 1
            df.at[index, 'R count senate'] = r_count_s_dict[age]
        elif row['Party'] == 'D':
            d_count_s_dict[age] -= 1
            df.at[index, 'D count senate'] = d_count_s_dict[age]
        elif row['Party'] == 'I':
            d_count_s_dict[age] -= 1
            df.at[index, 'D count senate'] = d_count_s_dict[age]

df.head(3)

Masking Zero Counts

We don't want to plot zeroes, so we'll use a mask to convert these values to NaN (Not-a-Number) values in our DataFrame.

# Filter out zero values:
mask = df != 0

# Apply the mask to the DataFrame:
df = df[mask]

df.head(3)

Defining a Function to Make the Plot

As mentioned previously, we'll make four plots. To avoid repeating code, we'll encapsulate the plotting instructions into a reusable function.

The function will take as arguments a DataFrame, the name of a matplotlib axes object, the column to use as an x-coordinate, a color, and a title. we'll turn off most of seaborn's default settings, such as axis ticks and labels, so that our plot is as clean and sparse as possible.

An important component of this plot is the rectangle used as a marker for each congressional member (marker=$u25AC$). This marker isn't part of the standard matplotlib collection but is part of the STIX font symbols. You can find a listing of these alternative markers here.

def make_plot(data, ax, x, color, title):
    """Make a custom seaborn scatterplot with annotations."""
    sns.scatterplot(data=data, 
                    x=x, 
                    y='Age', 
                    marker='$u25AC$', 
                    color=color, 
                    edgecolor=color, 
                    ax=ax, 
                    legend=False)

    # Set the border positions and visibility:
    ax.spines.left.set_position('zero')
    ax.spines.right.set_color('none')
    ax.spines.top.set_color('none')
    ax.spines.bottom.set_color('none')

    # Set x and y limits, ticks, labels, and title:
    ax.set_xlim(-15, 15)
    ax.set_ylim(25, 100)
    ax.tick_params(bottom=False)
    ax.set(xticklabels=[])
    ax.set(yticklabels=[])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.set_title(title)

    # Manually annotate the y-axis along the right border:
    ax.text(x=12.5, y=96, s='Age')
    ax.set_yticks(np.arange(30, 101, 10))
    ylabels = [30, 40, 50, 60, 70, 80, 90]
    for label in ylabels:
        ax.text(x=13, y=label, s=str(label))

    # Add shading and annotation for each generation:
    for _, (name, start_age, end_age, gcolor) in enumerate(zip(_GEN_NAMES, 
                                                               _GEN_START_AGE,
                                                               _GEN_END_AGE, 
                                                               _GEN_COLORS)):
        rect = patches.Rectangle((-15, start_age), 
                                 width=30, 
                                 height=end_age - start_age, 
                                 facecolor=gcolor, 
                                 alpha=0.3)
        rect.set_zorder(0)  # Move shading below other elements.
        ax.add_patch(rect)
        ax.text(x=-15, y=end_age - 2, s=name)

    plt.tight_layout()

Plotting the Figure

The following code sets up the figure and calls the make_plot() function four times. It finishes by adding a supertitle and a custom legend.

# Make the figure and call the plotting function:
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(8, 5))
make_plot(df, ax0, 'D count house', 'blue', 'House' )
make_plot(df, ax0, 'R count house', 'firebrick', 'House')
make_plot(df, ax1, 'D count senate', 'blue', 'Senate')
make_plot(df, ax1, 'R count senate', 'firebrick', 'Senate')

# Add figure title and custom legend:
fig.suptitle('Age of Us Congress 2023')
ax0.text(x=-15, y=17, s='$u25AC$ Democrat & Independent', color='blue')
ax0.text(x=1.7, y=17, s='$u25AC$ Republican', color='firebrick');

# Optional line to save figure:
# plt.savefig('age_of_congress.png', bbox_inches='tight', dpi=600)

Conclusion

The best infographics tell stories with a clean, eye-catching style. Just as really well-written Python code requires few to no comments, great infographics don't require a lot of labels or annotations.

In this project, we used pandas to load and prepare the data and seaborn to generate a scatter plot that mimics a bar chart. A key feature of this plot was the use of a STIX font symbol for the rectangular markers.

For datasets with many low-count values, this scatter plot approach is more visually pleasing than a standard bar chart where many of the bars will be short in length. Additionally, representing each member with a distinct marker "personalizes" the data more than showing a single bar for multiple members.