‘P-Hacking’ Lets Scientists Massage Results. This Method Could Nix That Loophole

Photo credit: Jonathan Kitchen - Getty Images
Photo credit: Jonathan Kitchen - Getty Images

The pursuit of science is designed to search for significance in a maze of data. At least, that’s how it’s supposed to work.

By some accounts, that facade began to shatter in 2010 when a social psychologist from Cornell University, Daryl Bem, published a 10-year analysis in the prestigious Journal of Personality and Social Psychology, demonstrating with widely accepted statistical methods that extrasensory perception (ESP), basically the “sixth sense,” was an observable phenomenon. Bem’s peers couldn’t replicate the paper’s results, quickly blaming what we now call “p-hacking,” a process of massaging and overanalyzing your data in search of statistically significant—and publishable—results.

♾ You love math. So do we. Let’s dive deep into its intricacies together—join Pop Mech Pro.

To support or refute a hypothesis, the goal is to establish statistical significance by recording a “p-value” of less than 0.05, explains Benjamin Baer, a post-doctoral researcher and statistician at the University of Rochester, whose recent work looks at addressing this issue. The “p” in p-value stands for probability and is a measure of how likely a null hypothesis result is versus chance.

For example, if you wanted to test whether or not all roses are red, you would count the number of red roses and roses of other colors in a sample and perform a hypothesis test to compare the values. If this test spits out a p-value of less than 0.05, then you have statistically significant grounds to claim that only red roses exist—even though evidence outside your sample of flowers suggests otherwise.

Misusing p-values to support the idea that ESP exists may be relatively harmless, but when this practice is used in medical trials, it can have much deadlier results, says Baer. “I think the big risk is that the wrong decision can be made,” he explains.“There’s this big debate happening across science and statistics, trying to figure out how to make sure that this process can happen more smoothly and that decisions are actually based on what they should be.”

Baer was the first author on a paper published at the end of 2021 in the journal PNAS along with his former Cornell mentor and professor of statistics, Martin Wells, that looked into how new statistics could improve the use of p-values. The metric that they looked at is called the fragility index and is designed to supplement and improve p-values.

This measure describes the fragility of a data set to some of its data points flipping from a positive to a negative result—for example, if a patient who’d been positively impacted by a drug actually felt no impact. If changing only a few of these data points is enough to demote a result from being statistically significant to not, it’s then considered fragile.

Photo credit: PM
Photo credit: PM

In 2014, physician Michael Walsh originally proposed the fragility index in the Journal of Clinical Epidemiology. In the paper, he and his colleagues applied the fragility index to just under 400 randomized control trials with statistically significant results and found that one in four had low fragility scores, meaning their findings may not actually be very reliable or robust.

However, the fragility index has yet to pick up much steam in medical trials. Some critics of the approach have emerged, like Rickey Carter from the Mayo Clinic, who says it’s too similar to p-values without offering enough improvement. “The irony is the fragility index was a p-hacking approach,” Carter says.

To improve the fragility index, Baer, Wells, and colleagues focused on improving two main elements to answer previous criticism: only doing sufficiently likely modifications, and generalizing the approach to work beyond binary 2x2 tables (representing positive or negative control and experimental group results).

Despite the uphill battle that the fragility index has fought thus far, Baer says he still believes it’s a useful metric for medical statisticians and hopes that improvements made in their recent work will help convince others of that, too.

“Talking to the victim’s family after a surgery fails is a very different [experience] than statisticians sitting at their desks doing math,” Baer says.

You Might Also Like

Advertisement