Analyzing Geo-based Campaigns with Synthetic Control

Author:Murphy  |  View: 21739  |  Time: 2025-03-22 21:36:48

Marketing Science 101: How to Analyze Geo-Based Campaigns With Synthetic Control

Photo by Z on Unsplash

(If you are not a Medium member, read the full story here)

Navigating the complexities of location-based testing can be challenging.

Luckily, GeoLift is here to optimize the process.

Developed by Meta's Marketing Science team, this open-sourced library is designed for conducting geo-based experiments. It enables us to:

  1. Effortlessly select test markets, and
  2. Generate a Synthetic Control market to benchmark against your chosen test locations

While the GeoLift documentation offers a comprehensive guide on selecting test markets, this article will focus specifically on how to create and utilize a synthetic control market.

Let's dive in.

Synthetic control method (SCM)

The graphs illustrate the comparison between the "difference in difference" method and the synthetic control method (SCM).

Generated by Author using ChatGPT x Wolfram

According to Wikipedia, SCM

is designed to estimate the potential outcomes for a treatment group had the treatment not been applied. Unlike the difference in differences approach, SCM adjusts for time-varying confounders by weighting the control group to closely resemble the treatment group prior to the intervention

Simply put, SCM constructs a "parallel universe" that allows you to observe potential outcomes if the campaign had not been launched, enabling a clear comparison and measurement of the campaign's true incremental effects

Streamlined guide to creating synthetic control

1. Process data

2. Generate weights

3. Visualize synthetic control

4. Measure lift

The first two steps are detailed in the GeoLift documentation. I'll include the code and output here for consistency.

We'll execute the initial steps in R and the final two in Python.

If you're primarily a Python user, don't worry – the R code is straightforward and easy to follow. I promise it will be a breeze!

1. Process data

What your data should look like:

  • Daily or weekly granularity
  • At least 4–5x the test duration of pre-campaign historical data
  • 20 or more geo-units
  • More on best practices

Install packages:

  • Install R and RStudio here
  • Install GeoLift package here

Load data:

  • Load GeoLift data included in GeoLift package into R. The data is simulated of 40 US cities across 90 days, from 2021–01–01 to 2021–03–31
library(GeoLift)

data(GeoLift_PreTest)
  • Check data
> head(GeoLift_PreTest)
  location    Y       date
1 new york 3300 2021-01-01
2 new york 3202 2021-01-02
3 new york 4138 2021-01-03
4 new york 3716 2021-01-04
5 new york 3270 2021-01-05
6 new york 3260 2021-01-06
  • Format data
GeoTestData_PreTest <- GeoDataRead(data = GeoLift_PreTest,
                                   date_id = "date",
                                   location_id = "location",
                                   Y_id = "Y",
                                   X = c(), #empty list as we have no covariates
                                   format = "yyyy-mm-dd",
                                   summary = TRUE)

##################################
#####       Summary       #####
##################################
* Raw Number of Locations: 40
* Time Periods: 90
* Final Number of Locations (Complete): 40
  • _Yid is your key performance indicator, such as revenue or active user count.
  • X = c() allows you to include optional covariates that dynamically correlate with your objective metric, such as user retention, rather than static metrics like city population or median income.

2. Generate weights

Let's say our experiment is going to be in Austin, Texas. Our objective is to create a synthetic control that closely mirrors Austin's conditions before the experiment begins.

Create weights:

  • Specify Austin as test city
weights <- GetWeights(Y_id = "Y",
                          location_id = "location",
                          time_id = "time",
                          data = GeoTestData_PreTest,
                          locations = c("austin"),
                          pretreatment_end_time = 90,
                          fixed_effects = TRUE)
  • _pretreatment_endtime = 90 uses the 90 days of historical data before the test begins to synthesize a control city. Adjust this duration as needed to fit your experiment
  • To test in multiple cities, specify them in locations, for example, locations = c("austin", "dallas")

To exclude any markets from control:

exclude_markets <- c("honolulu","washington")
GeoTestData_PreTest_Excl <- subset(GeoTestData_PreTest, !location %in% exclude_markets)

Then feed _GeoTestData_PreTestExcl into the GetWeights function above.

Display top weights:

> head(dplyr::arrange(weights, desc(weight)))
     location     weight
1  cincinnati 0.35232541
2     detroit 0.27955009
3    honolulu 0.12960818
4 minneapolis 0.10951033
5    portland 0.06265098
6 san antonio 0.01844960

Download weights and market data to csv:

write.csv(weights, "/Users/mandyliu/Documents/R/geolift_weights.csv", row.names=FALSE)
write.csv(GeoLift_PreTest, "/Users/mandyliu/Documents/R/market_data.csv", row.names=FALSE)

3. Visualize synthetic control

Now that we have the weights file, we'll create a synthetic control geo to compare against our test market, Austin.

Set up:

  • Import packages in Python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime, timedelta, date
  • Read data
df_weights = pd.read_csv('/Users/mandyliu/Documents/R/geolift_weights.csv')  
df_markets = pd.read_csv('/Users/mandyliu/Documents/R/market_data.csv')
Created by author in Jupyter notebook

Combine weights and cities together to create a single control:

# convert to pandas datetime
df_markets['date']=pd.to_datetime(df_markets['date'],format = '%Y-%m-%d')

# combine control markets with weights
df_markets_weights = df_markets.merge(df_weights, on='location')
df_markets_weights['weighted_Y'] = df_markets_weights['Y'] * df_markets_weights['weight']

# sum weighted_Y by date to create a single synthetic control city
df_syn_control = df_markets_weights.groupby('date').sum('weighted_Y')
.reset_index()[['date','weighted_Y']].rename(columns = {'weighted_Y':'Y'})
df_syn_control['location'] = 'syn_control'

# append Austin data to syn control
df_markets_austin = df_markets[df_markets['location']=='austin'].reset_index(drop = True)
df_syn_control_austin = pd.concat([df_markets_austin,df_syn_control],ignore_index = True)

Visualize Austin vs. synthetic control

sns.lineplot(data=df_syn_control_austin, x="date", y="Y",hue = 'location',palette=['purple', 'pink'])
plt.xticks(rotation=45)
Created by author in Jupyter notebook

Check correlation:

df_corr = df_syn_control[['date','Y']].merge(df_markets_austin[['date','Y']],on = 'date')
df_corr['Y_x'].corr(df_corr['Y_y'])

The correlation between Austin and the synthetic control stands at 0.950 – solid!

Measure lift

There is a slight twist.

Although Austin and its control show high correlation, the graph reveals that the control consistently registers higher values.

To facilitate a straightforward comparison, we can apply a multiplier to align the pre-launch values.

Create multiplier:

# create multiplier
austin_Y = df_markets_austin.loc[df_markets_austin['date']==df_markets_austin['date'].max(),'Y'].iloc[0]
syn_control_Y = df_syn_control.loc[df_syn_control['date']==df_syn_control['date'].max(),'Y'].iloc[0]

M = austin_Y/syn_control_Y

M = 0.9556415990151904

Simulate post-launch data:

  • Create two weeks' data after launch date on 2021–04–01
# create a list of dates
date_list = []
start_date = df_markets_austin['date'].max()

k = 15
for day in range(1,k):
    date = start_date + timedelta(days=day)
    date_list.append(date)

# create fake austin data post-launch
data_austin_new = {
    "date":date_list, 
    "Y":   df_markets_austin.tail(14)['Y'].values*1.2, #assuming a 20% lift
    "location":  ['austin'] * 14
}
df_austin_new = pd.DataFrame(data_austin_new)
df_markets_austin_test = pd.concat([df_markets_austin,df_austin_new])

# create fake synthetic control data post-launch
data_syn_control_new = {
    "date":date_list, 
    "Y":   df_syn_control.tail(14)['Y'].values, 
    "location":  ['syn_control'] * 14
}
df_syn_control_new = pd.DataFrame(data_syn_control_new)
df_syn_control_test = pd.concat([df_syn_control,df_syn_control_new])

#adjust synthetic control with multiplier M
df_syn_control_adj = df_syn_control_test.copy()
df_syn_control_adj['Y'] = df_syn_control_adj['Y']*M

# combine austin and adjusted control data
df_syn_control_austin_adj = pd.concat([df_markets_austin_test,df_syn_control_adj]
                                      ,ignore_index = True)

Visualize Austin vs. adjusted control:

ax = sns.lineplot(data=df_syn_control_austin_adj, x="date", y="Y",hue = 'location',palette=['purple', 'pink'])
ax.axvline(x = date(2021,3,31),c='b', linestyle = "dashed")
plt.xticks(rotation=45)
Created by author in Jupyter notebook

We've created a compelling graph that beautifully illustrates the different before and after launch!

Take-away

While real-life data may be more complex, comparing the test market and synthetic control enables us to observe incremental lift over time

Since we use historical data to calibrate the control group closely to the treatment group before intervention, the synthetic method often yields more reliable and interpretable results than the difference in difference approach.

Resources

  • Github link for everything in this note
  • Meta GeoLift blog
  • Meta GeoLift github
  • GeoLift user Facebook group

You might also enjoy this:

3 Painful Mistakes I Made as a Junior Data Scientist

_Connect with me: Twitter/X | LinkedIn_

Tags: Data Analytics Data Science Marketing Statistics Synthetic Control

Comment