Ways of Improving Your RFM Model in BigQuery

Author:Murphy | View: 20770 | Time: 2025-03-22 22:18:49

The RFM (Recency, Frequency, Monetary) model, with its simplicity and ease of implementation, remains a great tool for customer relationship management, offering valuable insights into customer behaviour.

Building on the groundwork from my previous article "How to Create an RFM Model in BigQuery", in this article, we will explore ways of improving the model.

Here's what we'll cover in this article:

How to Create Customer Score
Alternative RFM Banding Methods
Adding Milestones to the Model
Expanding the Model to Include Email Engagement
Building a Monthly RFM Model and Backdating Past Months

So, you've got your RFM model up and running in Bigquery, sorting your customers into groups like Champions, Potential Loyalists, At Risk of Losing and so on. It's a great start, but we can kick it up a notch.

While breaking down your customers into these groups tells a nice story, adding what I like to call a Customer Score can reinforce the model with a single, intuitive metric.

Why and What is a Customer Score?

Think back to the last time you decided to buy a mobile phone. There are tons of options out there, and trying to figure out which one has the best camera, battery life, or screen while also within your budget can feel like too much.

That's where tech review sites come in handy. Take DxOMark or benchmarks.ul for example. They break down all those complicated rankings into simple scores.

It makes comparing smartphones nice and easy without having to get into the nitty-gritty of how those scores came about.

This is pretty much what the Customer Score is. Just like those handy scores make picking a phone easier, the Customer Score simplifies understanding your customers by boiling down all their actions into one easy-to-get number.

How to Create a Customer Score

Let's make a Customer Score, building on what we did with the RFM model last time.

WITH rfm_groupings AS (
SELECT
    customer_id,
    NTILE(5) OVER (ORDER BY days_since_last_visit desc) AS recency_rank,
    NTILE(5) OVER (ORDER BY total_visits) AS frequency_rank,
    NTILE(5) OVER (ORDER BY total_spend) AS monetary_rank,
FROM
    (
    SELECT
        customer_id,
        DATE_DIFF(current_date() - 1, max(order_date), day) as days_since_last_visit,
        COUNT(order_date) AS total_visits,
        SUM(total_spend) as total_spend,
    FROM
        `medium_examples.fake_transactions`
    GROUP BY
        customer_id
    )
)

SELECT
    *
FROM
    rfm_groupings

We grouped customers by how recently they've shopped, how often, and how much they've spent. Now, we're going to get even more detailed by breaking it down into 100 segments, not just five. So, instead of saying someone's in the top 40% for spending (4th quintile), we can also say they're in the 66th percentile.

    NTILE(100) OVER (ORDER BY days_since_last_visit desc) AS recency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_visits) AS frequency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_spend) AS monetary_rank_detailed,

Then, we mix these percentiles with some custom weights that reflect what's important to us/the business. For example, in my setup, I've decided that how recently a customer shopped is 25% of their score, how often they shop is 40%, and how much they spend is 35%. This way, we're giving extra props to our loyal frequent customers.

cast
(
    ceil
    (
        (recency_rank_detailed * 0.25) + 
        (frequency_rank_detailed * 0.40) + 
        (monetary_rank_detailed * 0.35)
    )
as int64
) as customer_score

Now, our customers aren't just split into broad categories; they get their own score that shows us, at a glance, who's really more valuable to our brand.

Plus, this Customer Score is super handy for checking out how changes (like A/B tests) actually affect customer behaviour. Seeing the scores before and after a test can help reinforce which version is really winning.

WITH rfm_groupings AS (
SELECT
    customer_id,
    NTILE(5) OVER (ORDER BY days_since_last_visit desc) AS recency_rank,
    NTILE(5) OVER (ORDER BY total_transactions) AS frequency_rank,
    NTILE(5) OVER (ORDER BY total_spend) AS monetary_rank,
    NTILE(100) OVER (ORDER BY days_since_last_visit desc) AS recency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_transactions) AS frequency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_spend) AS monetary_rank_detailed,
FROM
    (
    SELECT
        customer_id,
        DATE_DIFF(current_date() - 1, max(order_date), day) as days_since_last_visit,
        COUNT(order_date) AS total_transactions,
        SUM(total_spend) as total_spend,
    FROM
        `medium_examples.fake_transactions`
    GROUP BY
        customer_id
    )
)

SELECT
    customer_id,
    concat(recency_rank, frequency_rank, monetary_rank) as rfm_segment,
    cast
    (
        ceil
        (
            (recency_rank_detailed * 0.25) + 
            (frequency_rank_detailed * 0.40) + 
            (monetary_rank_detailed * 0.35)
        )
    as int64
    ) as customer_score
FROM
    rfm_groupings

Switching to a Different Banding Approach

Our model right now uses a function called NTILE to split customers into groups, called quintiles. But here's the thing: if lots of customers are super similar (or our dataset is just not that varied), you might see customers hopping between groups every time you run the model.

It's kinda like being right on the border between two counties. One small step could technically flip you from one side to the other, making it look like you're jumping back and forth between counties without really moving much at all.

This can throw us off track, making us think people's habits are changing when they're actually staying pretty much the same. Let's look into some different ways to group our customers, beyond the usual NTILE method we've been using.

Fixed Values for Grouping

First up, we can think about setting specific cut-offs for our groups. This has its ups and downs. Below is an example of frequency.

CASE
    WHEN total_transactions <= 1 THEN 1
    WHEN total_transactions <= 3 THEN 2
    WHEN total_transactions <= 5 THEN 3
    WHEN total_transactions <= 10 THEN 4
    WHEN total_transactions > 10 THEN 5
END AS frequency_rank

Pros:

Clear-Cut: You know exactly what each group means, making things straightforward and consistent, unlike the NTILE method which can change every time you run it.
Tailor-Made: You can set your groups to match what you know about your customers. Say, if you notice most people buy once or twice, you can make that its own group.

Cons:

Unbalanced Groups: If you set your own groups, you might end up with a lot of people in some and hardly anyone in others. It's not necessarily bad, but it's something to keep in mind.
Keep It Fresh: Your groups might need tweaking as time goes on and your business or customers change. What used to be a VIP customer might not cut it anymore.

Getting Finer with NTILE

Another idea – use NTILE to cut your customer base into even smaller slices, like deciles (10 groups) or percentiles (100 groups).

This method makes those borderline cases less of an issue because you're dealing with much smaller differences between groups. It's like zooming in for a closer look.

But, here's the catch. The more you break down your groups, the trickier it can get to segment. Suddenly, you've got loads of segments to manage, and it can be tough to figure out how to target or understand them all.

NTILE(5) OVER (ORDER BY total_visits) AS frequency_rank_quintile,
NTILE(10) OVER (ORDER BY total_visits) AS frequency_rank_decile,
NTILE(100) OVER (ORDER BY total_visits) AS frequency_rank_percentile,

Both fixed values and NTILE have their perks. It really boils down to what works best for you, how your data looks, and what your goals are. Finding the right balance between detail and simplicity is key to making sure your customer groups really work for your Marketing plan.

Adding Milestones to the Model

Let's refine our Rfm model by incorporating a field I've termed "Milestones." This focuses on the length of time that has elapsed since a customer's first purchase. It's not just about how frequently they shop or their spending, but rather how long they've been with your company.

This allows us to more reliably gauge which of our customers are new, loyal in the long term or inactive/lapsed. Here are my milestones:

Newcomers (1 Month): Customers who made their first purchase within the last month. They're the newest additions to our company.
Nesting (1–3 Months): Relatively new and still getting to know us.
Established (3–12 Months): They've had some time to engage with our brand.
Multi-Year (1–3 Years): They have a longer history with us.
Long-Term (3+ Years): They represent our longest-standing relationships.

CASE 
        WHEN time_since_first_order <= 30 then "Newcomers (1 Month)"
        WHEN time_since_first_order <= 90 then "Nesting (1 - 3 Months)"
        WHEN time_since_first_order <= 365 then "Established (3 - 12 Months)"
        WHEN time_since_first_order <= (365 * 3) then "Multi-Year (1 - 3 Years)"
        WHEN time_since_first_order > (365 * 3) then "Long-Term (3+ Years)"
END AS milestones,

Why add this to our model? Well, while it can be insightful to see who's buying what right now, I believe knowing how long they've been shopping with us gives us another layer of insight.

Sometimes, with RFM, we guess who's new by seeing who recently made a purchase but hasn't bought much (R = 4–5, F = 1). But, to be honest, that method might not always hit the mark.

Bringing a "milestones" aspect into the RFM model makes things clearer. It's not just about spotting the latest purchases; it's about seeing our history with each customer. It helps us see the full story, showing us how our relationship with customers has grown over time, and to tailor our marketing strategy accordingly.

Expanding the Model to Include Email Engagement

Why not take a closer look at how customers interact with your emails? Clicking one of your emails means they're probably more into what you're offering than those who don't click at all.

With Apple's Mail Privacy Protection messing with open rates, we're focusing on email clicks to gauge interest. But here's a thing: not everyone gets the same amount of emails. Some people get 15, some only 3, so we need to be fair in how we judge who's really engaged.

We start by looking at the click rate, which is a step in the right direction. But there's still a bit of a challenge.

For example, if someone clicks 1 out of 1 email, that's a 100% click rate – sounds great, right? But then you have someone else who clicks on 8 out of 10 emails. They've got an 80% click rate and therefore sit lower on the leaderboard compared to the first customer, but they're keeping up with us more consistently.

To level the playing field, we can introduce a scaling system that takes into account not just how often they click, but also how many emails they've had the chance to click on. We're using logarithmic scaling here, which means we reward lots of clicks without unfairly treating those who had fewer emails to click on.

CEIL(
  (
    LOG(1 + emails_clicked) 
    * 
    (emails_clicked / NULLIF(emails_received, 0))
  ) * 100
, 1) AS weighted_engagement_score

Here is a few examples of what the engagement score could look like.

The formula looks at how often you click on emails and how many you get. So, if you're clicking a lot out of a bunch of emails, like 16 out of 19, you'll get a high score (104) because it shows you're really into what we're sending. But, if you click just once because you only got one email, your score (31) won't be as high.

This helps us spot the real email enthusiasts – people who not only engage with your email marketing but also have plenty of opportunities to do so. It's about balancing quantity with enthusiasm, making our model both fair and insightful.

By adding this score to our existing RFM model, using the same NTILE function, we get a new and improved RFM-E model.

Building a Monthly RFM Model Plus Backdating

Ever wondered how your customers' shopping habits change over time? Well, I think it's super important to keep an eye on this by taking snapshots of where each customer stands in your RFM model every month.

Interested? Here's how to get that set up.

Getting this set up is simpler than it sounds. First, we grab the current month using date_trunc(current_date(), month). To keep our query neat, we store this date in a variable and use it throughout our query. This way, we're making sure our snapshot captures the customer segments just as they are at the month's start.

Here's the setup in my query. I also included the customer score and milestones as highlighted earlier in the article.

DECLARE month_snapshot DATE;
SET month_snapshot = date_trunc(current_date(), month);

WITH rfm_groupings AS (
SELECT
    customer_id,
    milestones,
    NTILE(5) OVER (ORDER BY days_since_last_visit desc) AS recency_rank,
    NTILE(5) OVER (ORDER BY total_transactions) AS frequency_rank,
    NTILE(5) OVER (ORDER BY total_spend) AS monetary_rank,
    NTILE(100) OVER (ORDER BY days_since_last_visit desc) AS recency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_transactions) AS frequency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_spend) AS monetary_rank_detailed,
FROM
    (
    SELECT
        customer_id,
        CASE 
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= 30 then "Newcomers (1 Month)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= 90 then "Nesting (1 - 3 Months)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= 365 then "Established (3 - 12 Months)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= (365 * 3) then "Multi-Year (1 - 3 Years)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) > (365 * 3) then "Long-Term (3+ Years)"
        END AS milestones,
        DATE_DIFF(month_snapshot, max(order_date), day) as days_since_last_visit,
        COUNT(order_date) AS total_transactions,
        SUM(total_spend) as total_spend,
    FROM
        `medium_examples.fake_transactions`
    WHERE
        order_date < month_snapshot
    GROUP BY
        customer_id
    )
)
SELECT
    month_snapshot,
    customer_id,
    milestones,
    concat(recency_rank, frequency_rank, monetary_rank) as rfm_segment,
    cast
    (
        ceil
        (
            (recency_rank_detailed * 0.25) + 
            (frequency_rank_detailed * 0.40) + 
            (monetary_rank_detailed * 0.35)
        )
    as int64
    ) as customer_score
FROM
    rfm_groupings

Once you run this query, you can save the results in a table. I highly recommend setting up partitioning and clustering.

<>

CREATE TABLE medium_examples.rfm_monthly_snapshot
PARTITION BY month_snapshot
CLUSTER BY milestones, rfm_segment
as 
(
  <>
)

By creating your RFM snapshot table with partitioning and clustering, you're making sure that as your table grows with each monthly snapshot, your queries can remain quick and cost-effective. This is particularly important because you're likely to query this table frequently to analyse customer behaviour changes over time.

With your table setup, now you need to update your query to insert into the table.

<>

INSERT INTO medium_examples.rfm_monthly_snapshot
(
  <>
)

Next, automate this query to run at the start of each month (or a few days later if you're waiting on complete data). This way, without any extra effort, you keep getting fresh snapshots.

Back filling Previous Months

If you've been collecting transaction data for a few years, you might want to catch up on creating monthly snapshots for all the months you've missed.

The straightforward way would be to set a date variable at the start of your query to mark the snapshot's month, like so:

DECLARE month_snapshot DATE;
SET month_snapshot = "2024-02-01";

This method works fine, but imagine doing it month by month going back years if you're way behind. Sounds like a chore, right?

Here's a better idea: wrap your query into a stored procedure. This lets you run the same query by simply calling the procedure, and as a bonus, you can move the month_snapshot variable into the procedure declaration.

You'll see why it's a bonus soon!

CREATE OR REPLACE PROCEDURE medium_examples.insert_rfm_monthly_snapshot(month_snapshot DATE)
BEGIN

--delete the declare and set variable lines--
<>;
END

This allows us to call the procedure and specify the month we want to backfill from a single line of code.

CALL `medium_examples.insert_rfm_monthly_snapshot`("2024-02-01");

Here is my final code for the procedure.

CREATE OR REPLACE PROCEDURE medium_examples.insert_rfm_monthly_snapshot(month_snapshot DATE)
BEGIN

INSERT INTO medium_examples.rfm_monthly_snapshot
(
WITH rfm_groupings AS (
SELECT
    customer_id,
    milestones,
    NTILE(5) OVER (ORDER BY days_since_last_visit desc) AS recency_rank,
    NTILE(5) OVER (ORDER BY total_transactions) AS frequency_rank,
    NTILE(5) OVER (ORDER BY total_spend) AS monetary_rank,
    NTILE(100) OVER (ORDER BY days_since_last_visit desc) AS recency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_transactions) AS frequency_rank_detailed,
    NTILE(100) OVER (ORDER BY total_spend) AS monetary_rank_detailed,
FROM
    (
    SELECT
        customer_id,
        CASE 
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= 30 then "Newcomers (1 Month)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= 90 then "Nesting (1 - 3 Months)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= 365 then "Established (3 - 12 Months)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) <= (365 * 3) then "Multi-Year (1 - 3 Years)"
            WHEN DATE_DIFF(month_snapshot, min(order_date), day) > (365 * 3) then "Long-Term (3+ Years)"
        END AS milestones,
        DATE_DIFF(month_snapshot, max(order_date), day) as days_since_last_visit,
        COUNT(order_date) AS total_transactions,
        SUM(total_spend) as total_spend,
    FROM
        `medium_examples.fake_transactions`
    WHERE
        order_date < month_snapshot
    GROUP BY
        customer_id
    )
)
SELECT
    month_snapshot,
    customer_id,
    milestones,
    concat(recency_rank, frequency_rank, monetary_rank) as rfm_segment,
    cast
    (
        ceil
        (
            (recency_rank_detailed * 0.25) + 
            (frequency_rank_detailed * 0.40) + 
            (monetary_rank_detailed * 0.35)
        )
    as int64
    ) as customer_score
FROM
    rfm_groupings
);
END

Looping Through the Past

But what if you need to create snapshots for several past months? No problem. Use BigQuery's GENERATE_DATE_ARRAY to make a list of all the months you need, then loop through them.

First, in a new query just to familiarise ourselves with how this works, we need to generate an array of every month we wish to backfill. In my case, my earliest transaction is "2017–01–01", therefore my backfilling array starts from "2017–02–01".

DECLARE MONTH_RANGE ARRAY;
SET MONTH_RANGE = GENERATE_DATE_ARRAY("2017-02-01", "2024-02-01", INTERVAL 1 MONTH);
SELECT MONTH_RANGE;

Next, we build our for loop, telling the query to run through every item that exists in the month_range array we created. Then, for each month this iterates through, carrying out a certain task. For demonstration, we can select the value held in the for loop.

DECLARE MONTH_RANGE ARRAY;
SET MONTH_RANGE = GENERATE_DATE_ARRAY("2017-02-01", "2024-02-01", INTERVAL 1 MONTH);

FOR MONTH IN (SELECT month_snapshot FROM UNNEST(MONTH_RANGE) as month_snapshot)
DO
    SELECT MONTH.month_snapshot;
END FOR;

With our loop ready, we just swap out the SELECT statement for our procedure, feeding it the month picked up by the FOR LOOP.

DECLARE MONTH_RANGE ARRAY;
SET MONTH_RANGE = GENERATE_DATE_ARRAY("2017-02-01", "2024-02-01", INTERVAL 1 MONTH);

FOR MONTH IN (SELECT month_snapshot FROM UNNEST(MONTH_RANGE) as month_snapshot)
DO
    CALL `medium_examples.insert_rfm_monthly_snapshot`(MONTH.month_snapshot);
END FOR;

This way, you're automating the process, making it a breeze to fill in all those missing months.

And don't forget about partitioning your data table. It's a smart move because it organises your data by month, so you're only working with the data you need at any given time, which can save you on costs.

Once your backfill is complete, you can easily see how a customer's behaviour changes over time. Even if your transactions are a bit random, like in my example, you'll start to see patterns and shifts in customer segments over the years. It's a powerful way to understand your customer base and how your relationship with them evolves.

That brings this article to an end. I hope you found this helpful and if you have any questions, please feel free to comment, and I'll reply when I can.

I frequently write articles for BigQuery and Looker Studio. If you're interested, consider following me here on Medium for more!