Unlock Your Full Potential as a Business Analyst With the Powerful 5-Step Causal Impact Framework
In a business context, the leadership is often interested in the impact of a decision or event on the KPI of interest. As a performance analyst, I spend most of my time answering some variant of this question: "What is the impact of {News, government announcement, special event…} in the Country's X performance?". Intuitively, we can answer this question if we had a way of knowing what would have happened if the News/ announcement/ Special event had never happened.
This is the essence of Causal Inference, and some very talented people are working hard to make causal inference frameworks available for us to use.
Google Causal Impact library is one of those frameworks. Developed by Google to help them make better marketing budget decisions, this library can help us quantify the impact of any event or intervention on a time series of interest. It may sound scary, but it's actually quite intuitive.
As business analysts, we should leverage these tools in our day-to-day lives; here are 5 easy steps you can take to implement your first Causal Impact analysis.
Step 1: Install and Import Packages
For this guide, we will be using Python.
We will start by installing the Google Causal Impact package.
>pip install tfcausalimpact
you can find more information about this package in github:https://github.com/WillianFuks/tfcausalimpact
To run a Causal Impact analysis, you only need 4 packages.
from causalimpact import CausalImpact
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Import Data and Define Pre/Post Periods
We can think of the Causal Impact framework as a time series problem.
On a specific date, we observe an event, news, etc…. and track how our measure of interest changes after this event compared to some baseline. You can think of your baseline as a control group.
To perform a Causal Impact analysis, we will need the following:
- Date of the event to study
- Time Series of our variable of interest for our impacted unit
- Time Series of our variable of interest for our impacted unit
- Time Series of our variable of interest for other multiple units that were not impacted by the event
For my example, I am using a made-up scenario as follows:
- Event: Global conference happened on July 9th, 2022, in Barcelona
- Variable of Interest: Passenger revenue for an airline
- Impacted Unit: Barcelona
- Control group: Other European Markets
- Question: What was the impact of the Conference on Passenger revenue in Barcelona?
When considering the time frame, you should aim to keep the post-period as short as possible, and the pre-period should be longer than your post-period.
Please note that the data is artificial for illustration only.
#Define the dates
training_strat = "2022-03-10"
training_end = "2022-06-30"
treatment_start = "2022-07-01"
treatment_end = "2022-07-13"
#import the data- for my example i am using a CSV
data=pd.read_csv("example.csv", index_col='Date', parse_dates=True)
Step 3: Creating a Control Group
This step is the most critical part.
To estimate the impact of any event or treatment, we need to know our unit of interest (example: Barcelona Passenger Revenue) under the treatment and under no treatment. This is the main problem of causal inference.
We can never observe our unit of interest under two mutually exclusive circumstances. The solution is to create a counterfactual scenario.
You can think of a counterfactual as a hypothetical scenario where the event/treatment did not occur.
Our control group will help us create this scenario. To decide which markets to include in our control group, we need to calculate which markets have predictive power to predict our market of interest (Barcelona). We will use correlation to establish this predictive power.
This step needs to be done using the PRE-TREATMENT period only.
# get the training data only
df_training = data[data.index <= pd.to_datetime(training_end)]
When checking correlation using time series, we need our data to be stationary, meaning without a trend or seasonal components, to avoid finding spurious correlations.
# Check for stationarity
from statsmodels.tsa.stattools import adfuller
test = adfuller(x = df_training['Barcelona'])[1]
print(test)
if test< 0.05:
print("Data is stationary")
else:
print("Time series is not stationary")
However, most of the time series are not stationary, but we can use a method called differencing to make them stationary.
# Differening calculates the difference or Pct Difference from the previous date
differencing = df_training.pct_change().dropna(thresh = 1,axis=1).dropna()
differencing.head()
#Rest on the differenced data
test = adfuller(x = differencing['Barcelona'])[1]
print(test)
if test< 0.05:
print("Data is stationary")
else:
print("Time series is not stationary")
Now, we can use this new time series to establish the correlation between markets.
#Create correlation
market_cor= pd.Series(differencing.corr().abs()['Barcelona'])
#Create a list with markets with correlation coef >=0.3
markets_to_keep = list(market_cor[market_cor >=0.3].index)
#Keep only markets on the above list
final_data = data.drop(columns=[col for col in data if col not in markets_to_keep])
Step 4: Implementing CausalImpact
Now, we are ready to implement CausalImpact.
In a nutshell, CausalImpact will use our control group to learn to predict the Passenger Revenue in Barcelona during the Pre-period. The model will use this to predict the counterfactual post-period scenario where the Conference did not happen.
The Delta between what actually happened and the counterfactual scenario is the impact of the conference.
#Prepare Pre and Post periods
pre_period = [training_strat,training_end]
post_period = [treatment_start,treatment_end]
#Fitting CausalImpact
impact = CausalImpact(data=final_data,
pre_period=pre_period,
post_period=post_period)
Step 5: Interpreting Results & Validation
Google CausalImpact makes it very easy to visualize and summarize the results.
You can start by plotting the impact.
impact.plot()
print(impact.summary())
Posterior Inference {Causal Impact}
Average Cumulative
Actual 397631.92 5169215.0
Prediction (s.d.) 177248.77 (6914.43) 2304234.0 (89887.6)
95% CI [163527.02, 190631.09] [2125851.11, 2478204.1]
Absolute effect (s.d.) 220383.16 (6914.43) 2864981.0 (89887.6)
95% CI [207000.83, 234104.91] [2691010.9, 3043363.89]
Relative effect (s.d.) 124.34% (3.9%) 124.34% (3.9%)
95% CI [116.79%, 132.08%] [116.79%, 132.08%]
Posterior tail-area probability p: 0.0
Posterior prob. of a causal effect: 100.0%
For more details run the command: print(impact.summary('report'))
Depending on what you are measuring, you might be interested in average impact or cumulative impact. In our case, we are interested in the cumulative impact over the period of the conference.
Based on the analysis, the conference contributed to a +$2.8M upside in passenger revenue for Barcelona.
For a more comprehensive summary, you can use the below command.
print(impact.summary('report'))
Analysis report {CausalImpact}
During the post-intervention period, the response variable had
an average value of approx. 397631.92. By contrast, in the absence of an
intervention, we would have expected an average response of 177248.77.
The 95% interval of this counterfactual prediction is [163527.02, 190631.09].
Subtracting this prediction from the observed response yields
an estimate of the causal effect the intervention had on the
response variable. This effect is 220383.16 with a 95% interval of
[207000.83, 234104.91]. For a discussion of the significance of this effect,
see below.
Summing up the individual data points during the post-intervention
period (which can only sometimes be meaningfully interpreted), the
response variable had an overall value of 5169215.0.
By contrast, had the intervention not taken place, we would have expected
a sum of 2304234.0. The 95% interval of this prediction is [2125851.11, 2478204.1].
The above results are given in terms of absolute numbers. In relative
terms, the response variable showed an increase of +124.34%. The 95%
interval of this percentage is [116.79%, 132.08%].
This means that the positive effect observed during the intervention
period is statistically significant and unlikely to be due to random
fluctuations. It should be noted, however, that the question of whether
this increase also bears substantive significance can only be answered
by comparing the absolute effect (220383.16) to the original goal
of the underlying intervention.
The probability of obtaining this effect by chance is very small
(Bayesian one-sided tail-area probability p = 0.0).
This means the causal effect can be considered statistically
significant.
Unlike in machine learning, Causal Impact has no accuracy measures, which can make validation a bit tricky.
However, there are 3 things you can do to validate your results.
- Ensure your confidence interval in the Pre-period is not too broad – this could indicate that your control group is not predictive enough.
- Ensure the confidence interval of the estimated impact does not contain 0
- Use refutation tests, for example, if you conduct the same analysis but change the event date to any day in the pre-period, the impact should be 0
This is one of the most common analyses I do at work; once you get familiar with the 5 steps framework, you too can use it and be on your way to becoming a business analyst rockstar.
References
[1]Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). Inferring causal impact using Bayesian structural time-series models.
[2]Molak, A., & Jaokar, A. (2023). Causal Inference and Discovery in Python: Unlock the secrets of modern causal machine learning with DoWhy, EconML, PyTorch and more [Paperback]. May 31.
[3]https://research.google/pubs/pub41854/Inferring the effect of an event using CausalImpact by Kay Brodersen