Is This the Solution to P-Hacking?

Author:Murphy | View: 21198 | Time: 2025-03-23 12:02:26

In scientific research, the manipulation of data and peeking at results have been problems for as long as the field has existed. Researchers often aim for a significant p-value to get published, which can lead to the temptation of stopping data collection early or manipulating the data. This practice, known as p-hacking, was the focus of my previous post. If researchers decide to deliberately change data values or fake complete datasets, there is not much we can do about it. However, for some instances of p-hacking, there might be a solution available!

In this post, we dive into the topic of safe testing. Safe tests have some strong advantages over the old (current) way of hypothesis testing. For example, this method of testing allows for the combination of results from multiple studies. Another advantage is that you can stop the experiment optionally, at any time you like. To illustrate safe testing, we will use the R package safestats, developed by the researchers who proposed the theory. First, we will introduce e-values and explain the problem they can solve. E-values are already used by companies like Netflix and Amazon because of their benefits.

I will not delve into the proofs of the theory; instead, this post takes a more practical approach, showing how you can use e-values in your own tests. For proofs and a thorough explanation of safe testing, the original paper is a good resource.

An Introduction to E-values

In hypothesis testing, which you can brush up on here, you assess whether to retain the null hypothesis or to accept the alternative. Usually, the p-value is used for this. If the p-value is smaller than the predetermined significance level, alpha, you accept the alternative hypothesis.

E-values function differently from p-values but are related. The easiest interpretation of e-values is like this: Suppose you are gambling against the null hypothesis. You invest 1$, and the return value is equal to E$. If the e-value E is between 0 and 1, you lose, and the null hypothesis holds true. On the other hand, if the e-value is higher than 1, you win! The null hypothesis loses the game. A modest E of 1.1 implies limited evidence against the null, whereas a substantial E, say 1000, denotes overwhelming evidence.

Some main points of e-values to be aware of:

An e-value can take any positive value, and you can use e-values as an alternative to p-values in hypothesis testing.
An e-value E, is interpretable as a traditional p-value p, by the relation 1/E = p. Beware: It will not give you the same result as a standard p-value, but you can interpret it like a p-value.
In traditional tests, you have alpha, also known as the significance level. Often this value is equal to 0.05. E-values work a bit different, and you can look at them as evidence against the null. The higher the e-value, the more evidence against the null.
At any point in time (!) you can stop data collection and draw a conclusion during the test if you are using e-values. This is known as e-processes, and the use of e-processes ensures validity under optional stopping and allows sequential updates of statistical evidence.

Fun fact: E-values are not as ‘new' as you might think. The first paper on it was written in 1976. The values were not called e-values at that time.

A researcher gambling against… a hypothesis?! Image created with Dall·E 3 by the author.

Why should I care about E-values?

That is a valid question. What is wrong with traditional p-values? Is there a need to replace them with e-values? Why learn something new if there is nothing wrong with the current way of testing?

Actually, there is something wrong with p-values. There is a ton of criticism on traditional p-values. Some statisticians (over 800) want to abandon p-values completely.

Let's illustrate why with a classic example.

Imagine you are a junior researcher for a pharmaceutical company. You need to test the efficacy of a medicine the company developed. You search for test candidates, and half of them receive the medicine, while the other half takes a placebo. You determine how many test candidates you need to be able to draw conclusions.

The experiment starts, and you struggle a bit finding new participants. You are under time pressure, and your boss asks on a regular basis, "Do you have the results for me? We want to ship this product to the market!" Because of the pressure, you decide to peek at the results and calculate the p-value, although you haven't reached the minimum number of test candidates! Looking at the p-value, now there are two options:

The p-value is not significant. This means you cannot prove that the medicine works. Obviously, you don't share these results! You wait a bit longer, hoping the p-value will become significant…
Yes! You find a significant p-value! But what is your next step? Do you stop the experiment? Do you continue until you reach the correct number of test candidates? Do you share the results with your boss?

After you looked at the data once, it's tempting to do it more often. You calculate the p-value, and sometimes it's significant, sometimes it isn't… It might seem innocent to do this, while in fact, you are sabotaging the process.

Significant or not? Image created with Dall·E 3 by the author.

Why is it wrong to only look at the data and the corresponding p-value a few times before the experiment has officially ended? One simple and intuitive reason is because if you would have done something with other results (e.g. if you find a significant p-value you stop the experiment), you are messing with the process.

From a theoretical perspective: You violate the Type I error guarantee. The Type I error guarantee refers to how certain you can be that you will not mistakenly reject a true null hypothesis (= find a significant result). It's like a promise about how often you'll cry wolf when there's no wolf around. The risk of this happening is ≤ alpha. But only for one experiment! If you are looking at the data more often, you cannot trust this value anymore: the risk of a Type I error becomes higher.

This relates to the multiple comparisons problem. If you do multiple independent tests to proof the same hypothesis, you should correct the value of alpha to keep the risk of a Type I error low. There are different ways of fixing this, like Bonferroni, Tukey's range test or Scheffé's method.

The family-wise error rate for multiple independent tests. For one tests it is equal to alpha. Note that for 10 tests, the error rate has increased to 40%, and for 60 tests, it's 95%. Image by author.

To summarize: P-values can be used, but it can be tempting for researchers to look at the data before the sample size is reached. This is wrong and increases the risk of a Type I error. To guarantee the quality and robustness of an experiment, e-values are the better alternative. Because of characteristics of e-values, you don't need to doubt these experiments (or at least less, a researcher can always decide to fabricate data

Tags: E Values Hands On Tutorials Hypothesis Testing Safe Testing Statistics