From Probabilistic to Predictive: Methods for Mastering Customer Lifetime Value

Author:Murphy  |  View: 23076  |  Time: 2025-03-22 21:50:38
My iPad and I are back with more scrappy diagrams, in this, the final installment of my guide (for marketers and data scientists alike) to all things Customer Lifetime Value.

Welcome, once again, to my article series, "Customer Lifetime Value: the good, the bad, and everything the other CLV blog posts forgot to tell you." It's all based on my experience leading CLV research in a data science team in the e-commerce domain, and it's everything I wish I'd known from the start:

  • Part one discussed how to gain actionable insights from historic CLV analysis
  • Part two covered real-world use cases for CLV prediction.
  • Next we talked about methods for modelling historic CLV, including practical pros and cons for each.

This progression from use case examples to practical application brings us to today's post on CLV prediction: which methods are available, and what can marketers and data scientists expect from each, when trying to apply them to their own data? We'll look at probabilistic versus machine learning approaches, some pros and cons of each, and finish up with some thoughts on how to embark on your own CLV journey.

But first, let's remind ourselves why we're here…

The "why" of CLV prediction…

Last post focussed on analysing past data to investigate the spending habits of different portions of your customer base (known as cohorts). We wanted to answer questions like "how much is an average customer worth to me after 6 months?" and, "how do the different cohort groups differ in what they buy?" Now, we're interested in estimating future CLV, and not only on a customer group level, but for individual consumers.

Part two discussed the many reasons you might want to do this. Much of the motivation stems from automated customer management: Reliable, timely CLV predictions can help you understand and better serve your customer base, nudge customers along a "loyalty journey", and even decide which customers to "fire". CLV insights can also help you anticipate revenue, and even make better decisions about which inventory to maintain. Check out that post for more ideas, as well as part one, which is full of questions to help you discover the "why" of CLV for your own organisation.

And now, the how…

All of that sounds great, right? But how can it be achieved? Two groups of techniques can help: probabilistic models, and machine learning algorithms. Let's examine each in turn.

Probabilistic Models for CLV Prediction

The goal of probabilistic models for CLV prediction is to learn certain characteristics of our customer base's historic purchasing data, and then use those learned patterns to make predictions about future spending. Specifically, we want to learn probability distributions for customers' purchase frequencies, purchase value, and churn rate, since all these factors combine to generate any given customer's likely future CLV.

While there are a number of probabilistic models available, the "Beta-Geometric Negative Binomial Distribution" model (or "BG-NBD" for short), is the best known and most frequently applied. Understanding it will help you understand the probabilistic approach in general, so to help you do that, I'm going to take a deep dive now, but mark the most crucial concepts in bold. Feel free to skim over the bold parts first, and then re-read for the details.

The BG-NBD model uses the Beta, Geometric, and Negative Binomial Distributions to learn about typical purchase frequencies and churn rates among your customers:

  • A Beta distribution models the "buy till you die" process, which is the idea that each day, a customer tosses two coins: buy… or don't? And die… or don't? Of course, we don't expect them to literally die. Rather, we assume that at some point a customer will either decide to stop shopping with us, or they'll simply forget all about us, and on that day they'll cease to be a customer, whether they even realise it themselves or not.
  • A Geometric distribution models the time between purchases.
  • A Negative Binomial distribution models the total number of purchases a customer makes over time. It does this by combining the properties and assumptions of multiple distributions, including the Poisson distribution for purchase frequency and the Exponential for the variability in time between purchases.

Phew, that's a lot of talk about distributions. If you'd like to learn more, here's an excellent article. But it's also enough if you just understand the point of what we're trying to do: We want to use these distributions to estimate the likelihood that any given customer is "alive" at any given time, and how many future purchases they're likely to make. Then we just need to factor in spending, and we'll have an estimated future CLV. But how?

There are two ways to do this. The simplest is just to take historic average transaction value:

This relies on the simple assumption that alive probability and purchase value will stay fairly constant over the next n transactions. But of course, this is unlikely: p usually changes after every purchase, generally getting higher as the customer becomes more loyal. This is clear in the graph below: The blue line is the probability of a customer being ‘alive', and the red lines show purchases; With each purchase, the blue slope becomes flatter, as higher loyalty means the customer is more likely to stay ‘alive'.

Repeat purchases (red lines) generally increase P-Alive, the probability that a customer is ‘alive' (blue lines). Source: Author provided based on Lifetimes package (and yes, this one was hard to draw!)

The natural variability between customers is further driven by seasonality, global events, and all manner of other factors. So a better way to factor in purchasing value, and to capture variability in shopping patterns in general, is to include yet another probability distribution, called "Gamma". Here's how it works:

  • Your customer base will include everything from loyal, high-frequency buyers to infrequent, churn-prone buyers. The Gamma distribution represents how many of each kind of shopper you have, assigning different weights to different buying behaviours.
  • The "Gamma-Gamma model" uses two layers of Gamma distributions. The first assumes that the variation in average transaction size for each individual customer follows a gamma distribution. The second layer assumes that the parameters (i.e. the shape and scale) of this individual gamma distribution themselves follow another gamma distribution, reflecting the variation in spending habits across the entire customer base.

The Gamma-Gamma model is often combined with the BG-NBD model to predict future CLV in monetary terms. Sounds great (if not exactly simple), right? So what are the practical implications of this method?

Source: Author provided.

Pros and Cons of the Probabilistic Model

On the positive side:

  • It's tried-and-tested: This is an old, established technique, which has been successfully applied to diverse retail domains.
  • It's forward-looking: You can start making predictions into the future, and taking actions accordingly, to steer your business towards higher average CLV for all customers.
  • It makes churn explicit: One of the biggest ways to increase average CLV is to decrease churn rates. This technique explicitly models churn, allowing you to react to reduce it.
  • It ‘makes sense': The model parameters have intuitive meanings, meaning you can explore the learned distributions to better understand your customer base's behaviour.

These advantages come with some hurdles, however:

  • It only works for non-contractual, continuous shoppers; that is, shoppers who don't have a recurring contract, and who can shop at any time. It may not be well suited for non-contractual, discrete buyers, such as those who buy a newspaper every weekend, without actually having a subscription.
  • It can be computationally intensive to fit all these distributions, especially with large datasets.
  • It's not a time-series model: Time series models are classes of probabilistic and machine learning models designed to learn about seasonalities and trends. The BG-NBD model does not natively include such features, although we try to capture some of their influence through the Gamma-Gamma component. Thus, instead of relying 100% on a BG-NBD to forecast consumer spending, it might be desirable to do some dedicating time series modelling as well. This, of course, brings additional complexity and effort.
  • It's not typically profit-focussed: I've spoken a lot about the importance of thinking of the ‘V' in CLV in terms of profit, not just dollar transaction value. For example, a frequent buyer who returns many items, thus having a high transaction value but also causing the company significant shipping costs, should actually be considered a low CLV customer. Unfortunately, the BG-NBD model isn't explicitly designed to model transaction profit. You could try to incorporate it by ditching the Gamma-Gamma component and using a simple formula featuring average transaction profit instead:
  • Calculating margin isn't easy, though (as part three made very clear). You may wish to investigate variants of this model which try to handle this, such as the Pareto-NBD model which explicitly learns a relationship between the number of transactions and their average profitability. I've found these to be less well supported by coding libraries and best practices, however, so the learning curve for implementation is likely to be steeper.
  • It won't help with first time buyers: If you have customers with only one purchase, the BG-NBD model won't know whether they've already ‘died,' or are just going to be infrequent buyers going forward. In fact, customers with only one purchase will be rated definitely ‘alive,' as shown by the bright yellow bar in the plot below. Of course, this is unrealistic: Maybe their one purchase was such a bad experience that they'll never be back. Or maybe they bought a Porsche and they won't need another one any time soon. To help you figure this out, you may wish to combine your probabilistic model insights with a historical analysis into how many one-time customers you have, or how long a typical pause between first and second purchases is.
A typical CLV analysis graph of p(alive) based on Recency and Frequency. Long-term customers (high Recency) who purchase frequently are likely alive, which is reasonable. Yet, customers with only one purchase (represented as zero repeat purchases on the Frequency axis), are rated with p(alive) = 1 (definitely alive), which is unrealistic. Source: Author provided based on Lifetimes package

CLV Prediction with Machine Learning

We saw that probabilistic methods aim to learn distributions of individual features like customer spending rate, and then combine those learned distributions to make estimations. Machine Learning algorithms take a similar approach: here our goal is to learn relational patterns between features in some data, and then use those learned patterns to make predictions.

There are even more Machine Learning algorithms and architectures to choose from than there were with probabilistic approaches. So once again, I'll try to make the general and crucial concepts clear using one particularly well-known method: The RFM approach, where RFM stand for Recency, Frequency, and Monetary Value.

Let's start by clarifying the idea of learning patterns between features in data. It's obvious that an individual who shopped recently (Recency), shops often (Frequency), and spends a lot (Monetary Value), might be a high CLV customer. But how exactly do these three features combine to predict future CLV? Does recency trump frequency, meaning that if a customer used to shop with you often (good Frequency) but hasn't at all lately (poor Recency), they've churned, making their future CLV effectively $0? This seems plausible, but what kind of Recency values typically mark the point of no return? How does this value change, for high versus low frequency shoppers? We need to quantify exactly the strength and direction of influence each of these features has, in order to use them for making predictions.

To do this, we:

  • Take a dataset of customer purchases and divide it into pre- and post-threshold periods: for example, the pre-period could be the first 9 months of the year, and the post-period is the last.
  • Calculate the R, F, M features (and potentially others as well) for each individual customer in the pre-threshold period, and calculate the sum of their spending (Monetary Value) in the post-threshold period.
  • Train a machine learning algorithm on the pre-threshold features, and use it to make predictions about the post-threshold Monetary Value (MV), as if that data were the future. Of course, it's not really the future; It's still historic data, which means we have the true values and can compare them to what we predicted. Based on how wrong the predictions were, our algorithm will keep trying again until it learns to get its predictions close to being correct.

Since we have the Monetary Value labels for the post threshold period, we could try to predict those per customer (in machine learning terms, this is known as a regression problem). We could then rank customers, split them into some number of groups (based on collaboration with marketing or customer service experts, of course), and the business could develop tailored marketing or customer service strategies per group. In particular, customers with very low predicted CLV could be identified as churn risks, and given special treatment accordingly.

This might sound like a great plan, but it's not perfect: First, predicting exact values can be tricky, especially if the training data is small. Moreover, if you're going to rank and cluster customers based on predicted CLV anyway (and I'll come back to why you would do this in a moment), then why not try to predict the cluster directly? This would make the task a classification problem, and not only could it be easier to solve, but the outputs would be directly actionable: Customer A is predicted to land in the top tier CLV bucket; customer B is predicted to land in the bottom tier; we immediately know which campaign or strategy to funnel them to.

So how do we turn our regression problem to a classification one? When we calculate the post-threshold spending (MV) per customer, we need to cluster those values, assign them labels (such as low-, medium- and high-CLV), and train our classifier to predict those labels, instead of the underlying values. The only open question is: how to cluster the post-threshold MV values? The answer can be as simple as ranking and splitting into quantiles, such as the top 10%, middle 30%, and remaining 60%. Or, you could use a clustering algorithm: another type of machine learning algorithm which can discover clusters of values within a dataset. Whichever you choose should be based on collaboration with domain experts and those who intend to act upon the results of the project. The marketing team, for example, could help you decide how many quantiles or clusters would make sense in terms of developing targeted advertising campaigns to suit them.

Before we get to the pros and cons of ML approaches for predicting CLV, I'd like to clarify a couple of points from what we just saw.

  • First, I mentioned that you would probably rank and cluster customers after predicting their future CLV. You might be wondering, why go to the trouble? The thing is, it would technically be possible to not do this, and to instead create tailored strategies on individual customer level, based on their individual predicted future spending. However, such an approach is only feasible if it's fully data-driven and automated, and that's a huge undertaking in and of itself. Most companies just starting with CLV prediction won't be at that level yet.
  • Secondly, data practitioners should be aware of another approach, which involves clustering the R, F and MV input features calculated in the pre-threshold period, and using the cluster labels as input features, instead of the raw values. This might bring additional benefits like explainability: For example, it could be nice to explain to stakeholders that the trained model had quantified how customers in the best R, F and MV clusters produce the best future MV predictions. But of course, figuring out a good clustering strategy for each feature adds additional complexities, and will require much additional experimentation.
  • Third, and on the subject of input features, don't feel limited to Recency, Frequency and Monetary Value. Virtually any piece of information about customers could prove useful for understanding their shopping habits and predicting future spending. So think creatively, and ask your marketing and customer service teams for ideas: Demographic information, acquisition channel (did this customer first sign up instore or online, for instance), rewards program membership tier, emails clicked, number of returns, and many more, could all prove useful to a machine learning model.
Source: Author provided

Pros and Cons of Machine Learning Approaches

On the positive side:

  • It's forward-looking: Just as with the probabilistic models, machine learning approaches allow us to start making – and acting upon – estimations about the future.
  • It's versatile: You can potentially gain more accurate results by experimenting with which features you provide the model, enabling it to capture nuanced patterns in the input data which are good predictors of future spending.
  • It can unlock further insights: Machine learning models can detect patterns far too complex for humans to notice. Data scientists can apply explainability techniques to dig into what the model has learned about how each specific feature influences CLV, which can be immensely useful for marketers. For example, if having a high ‘Frequency' value turns out to be the a strong predictor of high future spending, the company could invest extra effort in keeping themselves front of mind for customers, and making the purchase process as enjoyable as possible so customers keep coming back. If Monetary Value turned out to be a more useful feature for the model, the company might instead concentrate on cross- and upsell techniques, or other ways to entice customers to spend bigger.

On the negative side:

  • It's harder to get right: Machine learning projects always add a certain degree of complexity, and using ML for CLV prediction is no different. Given all the different algorithms and training paradigms out there, the different features which could be used, and the different strategies for clustering the output predictions, data scientists have a lot to think about. Plus, if you want CLV predictions on a recurring basis, you'll need a plan to deploy, monitor, debug, and periodically re-train the model. Personally, I see this as a rewarding challenge: This is what makes our job as data scientists interesting! But it is something to bear in mind, especially when explaining to stakeholders the feasibility and expected timeline of taking an ML approach.
  • It doesn't explicitly model customer churn: This is one downside compared to the probabilistic model. The good news is, you can model churn yourself, using a dedicated prediction model. The bad news (which is not so bad, if you take my attitude to Data Science challenges), is that it'll come with all the same extra complexities I just listed above.

Summing up this CLV Series

And now, over 10,000 words later, it's time for me to wrap up this complete guide to Customer Lifetime Value. The first half of the series focussed on how to use CLV information in your business: Part one discussed how to gain actionable insights from historic CLV analysis, while part two covered real-world use cases for CLV prediction. The second half was all about practical data science techniques: Part three talked about methods for modelling historic CLV, including practical pros and cons for each, and today's post focused on CLV prediction: which statistical and Machine Learning methods are available, and what can marketers and data scientists expect when trying to apply them to their own data?

I structured the posts in this way as a reminder to practitioners not to jump straight to the most complicated machine learning algorithm you've got the compute power to run. It may be wiser to start with a historic analysis to understand the story so far, and to form hypotheses about what affects customer spending. Your Marketing team can already take actions from such information, and you may then want to move on to more complex, potentially more accurate techniques for making CLV predictions.

And that's it from me! Thanks to all of you who've been devouring these posts. I've seen your follows and highlights, and I'm thrilled that this has been useful for you. I apologise that this last post took quite some delay: I've been busy editing and co-writing a handbook of data science and AI, which has been a huge effort, but something I'm very proud of. Feel free to connect on LinkedIn or X if you'd like to stay updated on that. Otherwise, I hope to see you in one of my future posts on data science, marketing, Natural Language Processing, and working in tech.

Tags: Customer Lifetime Value Data Science Deep Dives Ecommerce Marketing

Comment