Enhancing NPS Measurement with LLMs and Statistical Inference

Introduction
In business analytics, calculating the Net Promoter Score (NPS) typically involves manual data annotation from employees. Some may think to use machine learning models to label the data, however this does not have the theoretical guarantees we get from human labeled data. Enter Prediction-Powered Inference (PPI), a new statistical technique that combines human and machine labeled data to create confidence intervals that are data efficient and theoretically guaranteed.
This article explores the intuition behind PPI and emphasizes why you would want to use it. We then jump into a code walkthrough of how to use it for two metrics: NPS and customer recommendations.
Prediction-Powered Inference (PPI)
PPI is a statistical technique proposed by Angelopoulos et al. [1]. The goal is to enhance confidence intervals by combining human and machine labeled data. Let's walk through some steps to motivate its usefulness.
In our use case we want to estimate the true NPS score given a set of customer reviews. Typically, an employee will manually read each review and assign a score from 1 to 10, a reliable but time-inefficient method. When dealing with numerous reviews it would be convenient to have a more automatic method.
To address this, we can leverage a machine learning model. A Large Language Model (LLM) is a good candidate to solve this problem because they generalize well to new tasks. The model is prompted to read the review and output a score. This is convenient, but the model comes with errors and imperfections. When making a decision, we need to make sure our data is aligned with human judgement.
Considering the limitations of both approaches, what if we could combine them? We can with Prediction-Powered Inference (PPI)! PPI is a framework that leverages the theoretical guarantees of human-labeled data for confidence intervals and the efficiency of machine-labeled data. With PPI, we aim to benefit from the strengths of both techniques.
How it Works
The core of PPI falls on something called the rectifier. We use the rectifier to account for the prediction error of our machine learning model. Using the rectifier, we can construct confidence intervals combining both the human and machine labeled data.
Here's the algorithm proposed for constructing the confidence intervals:

This approach enjoys a simple code implementation. Here is a short snippet to make this work:
def pp_mean_iid_asymptotic(Y_labeled, Yhat_labeled, Yhat_unlabeled, alpha):
n = Y_labeled.shape[0]
N = Yhat_unlabeled.shape[0]
tildethetaf = Yhat_unlabeled.mean()
rechat = (Yhat_labeled - Y_labeled).mean() # rectifier (delta hat)
thetahatPP = tildethetaf - rechat # Prediction-Powered Estimator
sigmaftilde = np.std(Yhat_unlabeled) # imputed std dev
sigmarec = np.std(Yhat_labeled - Y_labeled) # rectifier std dev
hw = norm.ppf(1-alpha/2)*np.sqrt((sigmaftilde**2/N) + (sigmarec**2/n)) # normal approximation
return [thetahatPP - hw, thetahatPP + hw] # confidence interval
If you want to go deeper I recommend this YouTube video with Clara Wong-Fannjiang (one of the authors) or looking at the paper. These resources do a much better job of explaining the concepts than I could accomplish here.
What's important to understand is that PPI has tighter confidence intervals than human only construction and has theoretical guarantees absent in prediction based confidence intervals. Understanding this fact is enough to work through the coding exercise.
The Approach
The link to the full notebook can be found here. I'll walk through the key steps with some commentary. A lot of this code is credited to the authors of the PPI library.
In our example we are going to use PPI to estimate the mean value. It is also possible to estimate other parameters, such as quantiles, logistic/linear regression coefficients, and others. After working through this example you can find more here.
Setting things up
First let's install the PPI library. To see more information about the library check out the repo here.
pip install ppi-python
For this example we will simulate data. It is hard to find data for NPS scores that are publicly available, so I've created the following code. The approach is still the same regardless of where the data comes from. We will create a DataFrame with the columnsOverall_Rating
(NPS scores) and Recommended
(a boolean value.)
def simulate_nps_scores(n, mu=3, mu2=9, std_dev=1):
# simulate each aspect of bimodal distribution
X1 = np.random.normal(mu, std_dev, n // 3)
X2 = np.random.normal(mu2, std_dev, n // 3)
X = np.concatenate([X1, X2])
X3 = np.ones(n - X.shape[0]) # make 1-inflated
X = np.concatenate([X, X3])
X = np.clip(X, a_min=1, a_max=10) # fix to 1-10 range for NPS
return X
def simulate_recommended(mean, n):
return np.array([1 if random.uniform(0,1) <= mean else 0 for _ in range(n)])
Using these functions we can construct the DataFrame:
N = 20000
data = pd.DataFrame({
'Overall_Rating': simulate_nps_scores(N),
'Recommended': simulate_recommended(0.34, N)
})
Simulating LLM Predictions
To make a more flexible demo I have given two options for creating predictions. The first is to create a simulated error for the predictions. This allows you to experiment with PPI and see how it works for different theoretical models. Here's an example of what this looks like:
if target_response == 'NPS':
Y_total = data.Overall_Rating.to_numpy()
Yhat_total = np.array([random.normalvariate(x, error_std_dev) for x in Y_total])
Yhat_total = np.array([max(min(x, 10), 1) for x in Yhat_total])
elif target_response == 'reccomended':
Y_total = data.Recommended.to_numpy()
Yhat_total = np.array([
x if random.uniform(0, 1) >= error_prob else int(not x)
for x in Y_total
])
else:
raise Exception('Invalid target_response')
LLM Predictions
If you have data with customer reviews, then it is easy to score them with a LLM. Here are some prompts you can use to do this:
NPS_prompt_template = lambda review: f"""Given the following review please return the Net Promoter Score (NPS).
Return only the integer value from 1-10 and nothing else.
Review:
{review}
NPS:"""
recommended_prompt_template = lambda review: f"""Given the following review please determine if the customer would recommend the business.
Return only 'True' or 'False'.
Review: {review}
Recommended:"""
You may wish to use recommended over NPS since the boolean classification problem is much easier. Some businesses prefer NPS because it's more industry standard. Depending on your problem you can choose which makes more sense.
With the LLM you can also check scores for different categories. These could be different products or services mentioned that you want to measure. This is a great benefit of using the LLM, since it is flexible for many problems, but we also account for the error by using PPI in our reporting.
Run PPI
Now it's time to construct confidence intervals. Here is a snippet of the code that does the heavy lifting:
for i in tqdm(range(ns.shape[0])):
for j in range(num_trials):
# Prediction-Powered Inference
n = ns[i]
rand_idx = np.random.permutation(n_total)
_Yhat = Yhat_total[rand_idx[:n]]
_Y = Y_total[rand_idx[:n]]
_Yhat_unlabeled = Yhat_total[n:]
ppi_ci = ppi_mean_ci(_Y, _Yhat, _Yhat_unlabeled, alpha=alpha)
classical_ci = classical_mean_ci(_Y, alpha=alpha)
What we are doing here is simulating PPI vs classical confidence intervals for different numbers of human responses (n). We can plot this information which is shown below.

You can see in this diagram comparing different values for n, the number of human labels used. PPI always has tighter confidence intervals than the confidence intervals using human labeled data alone. This demonstrates the key value of PPI: even though we have a flawed ML model, we still generate better confidence intervals than if we hadn't used it by combing them with human data.
We can see similar results for recommended below.

In this case we are looking at the true percentage of customers that recommend using the business out of all customers. Again we see that using PPI we are able to build tighter confidence intervals than if we had only used the human labeled data.
You'll also notice the confidence interval for the machine predictions in yellow. These predictions are not perfectly accurate so the confidence interval is way off. This is why we need some human labeled data and cannot use machine labeled data only.
Decision Making
Now let's consider how many human labels are needed to make a decision for the PPI and classical approaches.
Starting with NPS. Suppose we want to simulate how many human labeled examples we need to reject a null hypothesis that NPS is less than or equal to 4. We can run the following code to find the minimum value:
def _to_invert_ppi(n):
n = int(n)
nulls_rejected = 0
# Data setup
for i in range(num_experiments):
rand_idx = list_rand_idx[i]
_Yhat = Yhat_total[rand_idx[:n]]
_Y = Y_total[rand_idx[:n]]
_Yhat_unlabeled = Yhat_total[rand_idx[n:]]
ppi_ci = ppi_mean_ci(_Y, _Yhat, _Yhat_unlabeled, alpha=alpha)
if target_response == 'NPS' and ppi_ci[0] > null_hypothesis:
nulls_rejected += 1
elif target_response == 'recommended' and ppi_ci[0] > null_hypothesis:
nulls_rejected += 1
return nulls_rejected / num_experiments - statistical_power
n_ppi = int(brentq(_to_invert_ppi, 100, 1000, xtol=1))
This simulates the minimum number of examples needed to reject the null hypothesis for PPI. The code for the classical example follows similarly. See the notebook for full details.
Let's look at the output here:
The PPI test requires n=334 labeled data points to reject the null.
The classical test requires n=987 labeled data points to reject the null.
The interpretation is that PPI requires 653 less human labeled observations to reject the null hypothesis than using only human labeled examples.
We can repeat this process for recommended. The only change we make is the value of the null hypothesis. We run a test for the null hypothesis that the true percent of customers that recommend the business is less than or equal to 0.3.
The PPI test requires n=461 labeled data points to reject the null.
The classical test requires n=1000 labeled data points to reject the null.
Here we can see that 539 more observations are needed to draw a conclusion with the classical approach than the human approach.
Are these results meaningful?
653 or 539 observations may not seem like a lot, but in the world of internal data labeling it is. Suppose it's Friday afternoon and your boss asks you to determine what the NPS score is from a group of surveys that just came in. To make this determination you need to manually label some observations.
Suppose you can label 4 comments per minute. This implies you can label 240 comments per hour. If you use PPI, you would get to leave 2–3 hours earlier than if you used classical confidence intervals. Reducing mundane tasks has great benefits for employee happiness so this approach is worth investing in, since the overhead is simple.
Conclusion
This was a quick overview of how to use PPI to solve basic statistical inference problems. We saw how to calculate a population mean from a sample dataset for two different types of variables. This approach results in a meaningful time savings for very little extra work.
For more examples of how to use PPI, check out this examples folder from the repo. They cover many more interesting use cases. Happy coding!
Thank you for reading the article! If you have additional questions or something was unclear, leave a comment and I will get back to you. If you want to see more articles like this one, please follow me on Medium and on LinkedIn.
If you found a technical error in this article, please let me know ASAP! I strive to make sure the information I publish is as correct as possible, but no one is perfect.
References:
[1] Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, & Tijana Zrnic. (2023). Prediction-Powered Inference.