Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits
Let's dive right into a running example. Suppose a bank or a telecom company launches a new product/plan for existing customers. To promote this product, the company creates several call templates (scripts) for its sales representatives. The goal is to effectively convince customers to buy the new product or sign up for the new plan.
Here's how the campaign works:
- Call script creation: The marketing team develops multiple versions of the call script, each with a different approach to promoting the new product or plan.
- Agent calls: Sales agents use these scripts to call a subset of customers. Each customer interaction uses one of the predefined scripts.
- Data Collection: During the calls, the company collects data on customer responses, such as interest shown, questions asked, and, ultimately conversion rates (i.e., how many customers buy the new product or sign up for the new plan).
- Real-time analysis: The company analyzes this data in real time to evaluate the effectiveness of each script. This analysis helps determine which scripts are more successful in converting customers to the new plan/product.
- Strategy Update: Based on ongoing analysis, the company dynamically adjusts the frequency of use of each script. Scripts with higher conversion rates are used more often, ensuring that the campaign becomes increasingly effective over time.
Next, we show how to model a simple version of this campaign using the conventional multi-armed bandit problem. As we add more details to make this model more realistic, we demonstrate how existing solutions and their simple adaptations fall short. We then present a new budgeted multi-armed bandit algorithm from our paper accepted for the KDD 2024 conference that performs very well at this task. We also provide links to the code and a short video summarizing the paper.
In this story, "we" is used because I am writing it together with Marco Heyden (Linkedin, Github), the author of the algorithm idea and of our paper [1]. All subsequent plots are created by us and with the code from this Jupyter notebook.
Multi-armed Bandit Solution
Our scenario is similar to a multi-armed bandit (MAB) problem. Imagine a player in a casino facing a slot machine ("bandit") with multiple arms, each with an unknown payoff (reward) distribution. The player's goal is to maximize their total winnings by deciding which arms to play and how often to play each one. The challenge lies in balancing exploration, trying different arms to gather information about their rewards, and exploitation, using the information gathered to play the arm with the highest known reward.
In our example, each call script acts like a slot machine arm, where the reward is the success of the script. The reward is 1 if the customer signs up for the new plan or buys the new product and 0 otherwise. For instance, three call scripts with conversion rates of 0.1, 0.3, and 0.7 have successes following a Bernoulli distribution with expected values of 0.1, 0.3, and 0.7, respectively. The figure below illustrates the cumulative rewards for different strategies. The purple line represents using the script 1 with a conversion rate of 0.1, while the green line represents using the script 3 with a conversion rate of 0.7. These lines define the range of probable rewards. The light blue line shows the cumulative reward for a strategy that randomly selects a script for each call. In a realistic environment where only estimates of conversion rates are available, the cumulative reward of a good strategy should be close to the green line and at least above the light blue line.

A popular strategy for the multi-armed bandit problem is the Upper Confidence Bound (UCB) algorithm [2]. It assigns an upper confidence bound to the expected reward of each arm (call script) and plays the arm with the highest upper confidence bound. In this way, the algorithm actively explores actions with high uncertainty while exploiting actions with high known rewards. Mathematically, the UCB algorithm plays the arm i, which solves

- rᵢ(t) is the empirical mean reward of arm i up to time t.
- Nᵢ(t) is the number of times arm i has been played up to time t.
- t is the total number of plays so far.
The white line in the figure below shows this strategy in action for our example.

This upper bound is based on the Chernoff-Hoeffding bounds assuming a payoff distribution with support in [0,1], which is exactly our case. For reward distributions with different support [a_ᵢ_ʳ,b_ᵢ_ʳ], where aᵢʳ and a_ᵢ_ʳ are finite, the UCB should be scaled accordingly:

The Importance of the Budget
So far we have focused on maximizing the sum of rewards after a given number of calls. However, it is unrealistic to expect calls corresponding to different scripts to have the same duration. If the marketing team's capacity does not allow them to reach all customers within a given time budget (e.g., several months), it would be more practical to maximize the cumulative reward for a given call duration rather than the number of calls.
In our example, assume that the call duration (=costs) for scripts 1, 2, and 3 are constant (we will relax this assumption later) and equal 1, 2, and 8 minutes, respectively. If we now plot the results based on the total call duration instead of the number of calls, the strategy that always uses script 2 becomes the best choice, while the strategy that uses script 3 becomes the worst. The above UCB strategy that only considers conversion rates now performs much worse than a random strategy.

It is not difficult to cure the UCB strategy by normalizing reward estimates rᵢ(t) with cᵢ counterparts and adjusting the UCB as in formula (2) above:

The UCB strategy that was updated in this way is working quite well again:

Random Call Duration
Observe that the assumption of a fixed call duration is also unrealistic. When both reward and cost are not fixed, there are several ways to extend the UCB strategy to this case, such as
Reductive: By treating the reward to cost ratio vᵢ=rᵢ/cᵢ as a single random variable and pulling the arm with the highest upper bound UCBᵢᵛ of it:

United: By ignoring the variation in costs and using their estimates cᵢ(t) to scale the UCBᵢʳ of rewards similarly to formula (3):

Composite: By pulling the arm that maximizes the ratio UCB_ᵢʳ/_LCBᵢᶜ of the upper reward bound to the lower cost bound:

In (6) we assume that the rewards come from [0,1], as before, to keep the formula a bit simpler.
All the above strategies have problems, being either excessively optimistic, too pessimistic, or just optimizing the wrong quantity.
The Reductive strategy (4) **** seeks to maximize the expectation of the reward-cost ratio