Grow a Treemap with Python and Plotly Express

Author:Murphy  |  View: 26908  |  Time: 2025-03-23 19:00:48
Photo by Robert Murray on Unsplash!

Hierarchical data is a data model where items are linked to each other in parent-child relationships, forming a tree structure. Some obvious examples are family trees and corporate organization charts.

A treemap is a diagram that represents hierarchical data using nested rectangles. The area of each rectangle corresponds to its numerical value. Treemaps have been around for about 30 years. An early application was to visualize hard drive usage, as demonstrated in the figure below.

Allocation of hard disk space visualized with a treemap (Carnivore1973 via Wikimedia Commons)

Treemaps let you capture both the value of individual categories and the structure of the hierarchy. They're useful for:

  • Displaying hierarchical data when the number of categories overwhelms a bar chart.
  • Highlighting proportions between individual categories and the whole.
  • Distinguishing categories using different sizes and colors.
  • Spotting patterns, primary contributors, and outliers.
  • Bringing a fresh look to data visualization.

In this Quick Success Data Science project, we'll use Python to create a treemap to help people budget for expenses. We'll first use the tabula-py library to turn a web-based Bureau of Labor Statistics PDF into a pandas DataFrame. Then we'll use the Plotly Express library to turn the DataFrame into an attractive and interactive area-based visualization.


Consumer Expenditure Surveys

With the pandemic and subsequent inflationary surge, consumer spending has drawn a lot of attention. Individuals need to know how to budget effectively, and policymakers need to understand what sectors are causing the most financial burden to potential voters.

To help track income and spending, the Census Bureau uses Consumer Expenditure Surveys to collect information on US consumers' expenditures, income, and demographic characteristics. The Bureau of Labor Statistics (BLS) then compiles these statistics into annual reports published in September.

These BLS reports are in the public domain and are useful to both policymakers and individuals. Young people just starting out can use the tables as a starting point for preparing budgets and savings plans. Older people can use the same tables to help plan their retirement. You can view a complete table here and part of one in the figure below.

The first lines of the 2021 BLS Consumer Expenditure Survey Table 1300 (from the Bureau of Labor Statistics)

Consumer Expenditure tables lump expenditures into 14 different types, as listed below:

Explanation of expenditure types (from the Bureau of Labor Statistics)

The cash contributions category includes support for college students living away from home, alimony and child support payments, and personal cash gifts, such as those for birthdays or weddings. It also includes contributions to religious, charitable, or political organizations.


The Plotly Express and Tabula Libraries

Plotly Express is a higher-level version of the Plotly graphing library. It lets you easily produce attractive figures with a lot of built-in functionality.

A weakness of treemaps is that small rectangles may not be labeled, or the labels may be illegible. Plotly Express helps to overcome this limitation by providing an interactive "hover window" that appears when the cursor pauses over a rectangle. This popup window contains detailed information that would be impractical to display directly on the diagram.

The _tabula-py_ library is a Python wrapper of tabula-java, which enables table extraction from a PDF. The extracted data can then be converted into a list of DataFrames, or a CSV, TSV, or JSON file.

You can install Plotly Express and tabula-py with pip or conda. Plotly Express requires Plotly as a dependency. Here's the conda installation example:

conda install plotly plotly_express tabula-py


The Process

Our job will be to convert the BLS "Table 1300" PDF into a pandas DataFrame and then into a treemap. While tabula-py provides multiple format conversions for PDFs, we'll take the CSV route here:

  1. Use tabula-py to read the PDF from the web and convert it to a CSV.
  2. Use pandas to convert the CSV to a DataFrame.
  3. Use pandas to clean and prepare the DataFrame for plotting.
  4. Use Plotly Express to plot the treemap.

Importing Libraries

The following snippet imports the libraries we'll need, sets up Jupyter Notebook to show the maximum rows in a DataFrame, and rounds float values in DataFrames to two decimal places.

import string

import pandas as pd
import plotly.express as px
import tabula 

# Permit display of entire DataFrame and set decimal precision:
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', '{:.2f}'.format)

Converting the PDF Table into a DataFrame

To convert the web-based PDF into a DataFrame, we'll first use tabula-py to convert it into a CSV file and then use pandas to convert the CSV file into a DataFrame.

# Convert web-based PDF into a CSV file:
URL = 'https://www.bls.gov/cex/tables/calendar-year/
aggregate-group-share/reference-person-age-ranges-2021.pdf'
tabula.convert_into(URL, "output.csv", output_format="csv", pages='all')

# Convert CSV file into a DataFrame and inspect:
df = pd.read_csv('output.csv', header=1)
df.head()
The head of the initial DataFrame (image by author)

Wrangling the Data

Now we need to clean the DataFrame. Row 4 contains the number of consumer units, in thousands, for each age bracket, stored in columns 3–9. We'll need the number of consumer units for each bracket to determine average expenditures later. The following code extracts the unit information, removes punctuation, converts the results to integers, and multiplies by 1,000.

units = df.iloc[4, 3:10].apply(lambda x: int(x.translate(
                             str.maketrans('', '', string.punctuation))) * 1000)

Next, we create a dictionary that uses as keys and values the age brackets – based on the DataFrame column headers – and the number of consumer units in each bracket.

# Define age brackets based on DataFrame column headers and create a dictionary:
brackets = ['25-34', '35-44', '45-54', '55-64', '65 years', '65-74', '75 years']
num_units_per_age = dict(zip(brackets, units))

print(num_units_per_age)
{'25-34': 21024000, '35-44': 22921000, '45-54': 22276000, '55-64': 24751000, '65 years': 36016000, '65-74': 21479000, '75 years': 14537000}

Preparing an Age Bracket for Plotting

We'll want to plot a single age bracket at a time. The following code designates the bracket (for 25–34-year-olds), produces a new DataFrame for this bracket, does some more data cleaning and renaming, and calculates the percentage share and average expenditure for each expenditure type.

The expenditure type names, such as "Food" and "Housing," were found by scrolling through Table 1300. These represent high-level summaries of much more granular expenditure details. The age bracket column represents the percent of the aggregate spending attributed to that age group.

# Designate age bracket to examine (must match column header name): 
AGE = '25-34'

#Prepare DataFrame:
df = df.iloc[4:, :]
df.columns = ['Expenditure Type', 'Aggregate'] + list(df.columns[2:])
df = df[['Expenditure Type', 'Aggregate', AGE]]

# Strip punctuation and leading and trailing whitespace from columns:
df['Expenditure Type'] = df['Expenditure Type'].str.replace(
                                           '[^ws]','', regex=True).str.strip()
df['Aggregate'] = df['Aggregate'].str.replace('[^ws]','', regex=True)

# Make top-level 'Expenditure Type' the index and keep only selected rows:
df = df.set_index('Expenditure Type', drop=True)
df = df.loc[['Food', 'Housing', 'Transportation', 'Healthcare', 'Education',
             'Reading', 'Alcoholic beverages', 'Apparel and services',
             'Entertainment', 'Personal care products and services', 
             'Tobacco products and smoking supplies', 'Miscellaneous', 
             'Cash contributions', 'Personal insurance and pensions'], :]

# Rename columns for smaller Treemap labels:
df.rename(index={'Alcoholic beverages': 'Alcohol', 
                 'Apparel and services': 'Apparel',
                 'Personal care products and services': 'P-Care',
                 'Tobacco products and smoking supplies': 'Smoking',
                 'Miscellaneous': 'Misc.',
                 'Cash contributions': 'Cash Contr.',
                 'Personal insurance and pensions': 'Insurance'}, inplace=True)

# Calculate percent share and average expenditure (in $) for each type:
df['Expenditure'] = df['Aggregate'].astype(float) * df[AGE].astype(float) / 100
df['Expenditure Percent'] = (df['Expenditure'] / df['Expenditure'].sum()) * 100
df['Ave Expenditure'] = (df['Expenditure'] * 1000000) / num_units_per_age[AGE]
df['Ave Expenditure'] = df['Ave Expenditure'].astype(int)
df.head(15)
The DataFrame ready for plotting (image by author)

Plotting the Treemap

Plotly Express comes with over 30 functions for creating entire figures at once. The treemap() function takes the DataFrame, "the top-level" category (in this case, the "Expenditure Type"), the remaining categories (df.index), and column names for the values and colors. We'll also specify a color scale ("portland") and dimensions for the figure.

# Create treemap:
fig = px.treemap(df, path=[px.Constant('Expenditure Type'), df.index],
                 values='Expenditure Percent',
                 color='Ave Expenditure',
                 color_continuous_scale='portland',
                 width=900, height=500)

# Update layout:
title = f'Average Expenditure for {AGE} year-olds 
(Bureau of Labor Statistics 2021)'
fig.update_layout(title=title, margin=dict(t=40, l=10, r=10, b=25))
The treemap for average expenditures for 25–34-year-olds (image by author)

You can see how this figure could be useful for financial planners trying to explain Budgeting to a client. Compared to several pages of text, it's a lot easier on the eyes. And with Plotly Express, you can drill down to see the detailed values by simply hovering your cursor over a rectangle, as demonstrated for the healthcare sector below.

Example of the hover feature showing details for healthcare spending (image by author)

Although it won't change the appearance of the table, you can switch the display to show "Expenditure Percent" by switching the arguments for the values and color parameters, as so:

fig = px.treemap(df, path=[px.Constant('Expenditure Type'), df.index],
                 values='Ave Expenditure',
                 color='Expenditure Percent',
                 color_continuous_scale='portland',
                 width=900, height=500)
The treemap for average expenditure percent for 25–34-year-olds (image by author)

Based on the previous treemap, it's apparent that 25–34-year-olds tend to spend most of their income on housing, transportation, and insurance. But what about seniors?

If you change the AGE variable to '75 years' and rerun the code, you'll see that healthcare has become much more important, as are cash contributions, which probably take the form of charitable donations and gifts to family. The color bar also indicates that expenditures are generally lower than for the 25–34-year-old bracket.

The treemap for average expenditures for 75+ year-olds (image by author)

One final observation about treemaps: because the data is sorted, you can distinguish between categories with similar values. In the previous figure, while both "Food" and "Transportation" are similar in size, you can be sure that "Food" is larger because of its position in the hierarchy.

Caveat Emptor!

The values in the BLS table represent nationwide averages. Local variations may occur. For example, housing and transportation costs will undoubtedly be higher in large urban areas versus small rural towns. Additionally, all the expenditure estimates are subject to error. Therefore, this data should be used directionally, rather than absolutely, for any financial planning purposes.


Summary

The tabula-py module greatly simplifies the process of converting PDF tables into useful formats like CSV. Likewise, Plotly Express, paired with a pandas DataFrame, makes it easy to generate interactive treemaps. With a treemap, you can easily visualize – and communicate – relationships in hierarchical data.

Treemaps are interesting, but they aren't the only game in town. To see the same basic data presented with Sankey diagrams, check out this post in the Visual Capitalist.


Thanks!

Thanks for reading. If you enjoyed this article, then check out my books, Impractical Python Projects and Real-world Python, _ for more coding ideas._ And follow me to see more _Quick Success Data Scienc_e projects in the future.

Tags: Budgeting Plotly Express Python Programming Tabula Py Treemap

Comment