Common AB testing mistakes. Vol 2

Author:Murphy | View: 24113 | Time: 2025-03-23 18:56:14

A year ago, I published an article discussing common mistakes during AB testing. It appears that many people are keenly interested in experimentation challenges and overcoming them. As such, I've decided to publish an article on the next three common mistakes that people make.

By avoiding these common mistakes, we can ensure that our experiments are reliable, valid, and informative, ultimately leading to better decision-making and more successful outcomes.

Multiple the required sample size by number of hypotheses

There is a well-known formula to calculate the sample size.

It takes into account the variance of the metric, significance level, power of test, and MDE (minimum detectable effect) .

However, when conducting multiple hypotheses testing, people often make the mistake of simply replacing the number "2" with the number of groups

Is it the right approach? Not exactly. Increased number of hypotheses leads to an inflated type I error rate, therefore we need to control the type I error and therefore significance level. To control it, the Bonferroni correction is commonly used. The main idea is dividing the type I error by the number of hypotheses.

Every comparison between groups should be considered as a separate hypothesis, rather than just every group.

Therefore, when there are 4 groups, for instance, the number of hypotheses is 6, which is the number of possible combinations of groups.

Our correct formula is :

Let's compare the wrong approach and the right one.

For example, when MDE is 0.1, significance level is 0.05, power of test is 0.8, and variance is 1.5, the wrong approach would require 7064 samples, whereas the correct approach would require 10899 samples.

Concluding an AB test after 7064 samples can result in an erroneous decision.

Not running health checks

Most people rush into conducting AB tests without first performing health checks. Health checks can ensure that the testing environment is stable and unbiased. If the testing environment is unstable or biased, the test results may be invalid and unreliable.

A/A testing on historical data is an example of such checks. When running an A/A test, it's crucial to observe the distribution of p-values, rather than focusing on a single number, since finding a difference between control and treatment groups is always a possibility.

The approach is the following:

Select a sample size. You should choose the same formula and similar values to those used in real A/B tests.
Create control and treatment groups. It's essential to use the same splitting algorithm used in the production system, just applied to historical data.
Measure the results: Measure the results for both groups. Calculate the desired metric.
Analyze the results: Compare the results for both groups to ensure that they are statistically similar. This can be done by calculating the p-value.
Repeat the steps 1–5 at least a thousand times.
Upon repeating the A/A test multiple times, examine the distribution of obtained p-values. The distribution should be uniform. If not, it indicates that your health check is incomplete and further analysis is required.

Uniform P value distribution. Image by Author.

To be indifferent about the negative results

In fact, ignoring negative results can have serious consequences for a business's bottom line.

First and foremost, negative results can provide valuable information about what doesn't work. It's easy to get excited about the positive results of an AB test, but negative results are equally important. They can reveal flaws in a design or strategy, highlighting areas that need improvement or further exploration. If a business ignores negative results and simply goes with the option that performed better in the test, they may be missing out on opportunities to make meaningful improvements.

Moreover, negative results can be a warning sign that something isn't working as intended. For example, if an AB test reveals that a new design change actually performs worse than the previous version, it could indicate a deeper issue with the design process or user experience. Ignoring negative results in this scenario could lead to a decline in user engagement, customer loyalty, and ultimately, revenue.

Thanks for reading and don't be afraid to make mistakes and learn. It's the only way to progress!

Tags: Ab Testing Data Science Experiment Machine Learning Statistics