Researchers propose p-value change from 0.05 to 0.005

U.Va. scientists among those seeking to enhance reproducibility in scientific investigations

In a forthcoming research paper from “Nature Human Behavior,” a group of scientists — including University Psychology Prof. Brian Nosek — propose to change the p-value threshold for statistical significance from 0.05 to 0.005 in order to enhance the reproducibility of data. 

According to an article written by UCLA Biostatistics Prof. Frederick Dorey and published in the journal “Clinical Orthopaedics and Related Research,” a p-value is a calculated probability that tests a null hypothesis, a statement that expresses the opposite of the hypothesis being investigated in a scientific experiment. 

This value is often required to be calculated in publishable research papers that compare quantitative data between two or more experimental groups, Chemistry Asst. Prof. Rebecca Pompano said.

A p-value allows scientists to determine statistical significance — the notion that an experimental result is likely attributable to a specific cause rather than mere chance — of their results. Smaller p-values — suggesting strong evidence against the null hypothesis — likely correlate with more precise data, indicating potential reproducibility — and thereby credibility — of a scientific experiment. 

Presently, the accepted p-value for statistical significance rests at 0.05. As such, a p-values less than 0.05 represents statistical significance. This cutoff was arbitrarily determined by British statistician and geneticist Sir Ronald Fisher in the early 1900s. 

“Sir Ronald Fisher proposed it in one of his articles or books,” Statistics Prof. and Chair of Statistics Karen Kafadar said in an email to The Cavalier Daily. “As I recall, he tossed it off as ‘If the probability of observing our data under our hypothesis is less than 0.05, we might consider that to be statistically significant.’ And that 0.05 seems to have stayed with us ever since.” 

A recent paper by a group of researchers from numerous academic institutions —  including the University of Southern California, Duke University, University of Amsterdam, University of Pennsylvania, Harvard University, Stanford University and the University of Virginia — however, challenges the longstanding p-value of 0.05. 

“The lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on ‘statistically significant’ findings,” the paper, released as a preprint article on PsyArXiv last month, said. “For fields where the threshold for defining significance for new discoveries is P < 0.05, we propose a change to P < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields.”

This proposal seeks to encourage strength of evidence by calling probability values less than 0.005 “significant” and those between 0.05 and 0.005 “suggestive,” Nosek said in an email to The Cavalier Daily.

Current scientific literature varies in reliability between fields and research journals — the primary sources of study publications. Commonly, lower-quality journals publish untrustworthy papers, as do some high-end elite journals, in which data presented may be cherry-picked by the investigator to present a case as more scientifically elegant than reality. These circumstances may be caused by a scientist’s lack of knowledge and proficiency in their field, or driven by an individual’s desire for vocational success and economic incentive — often furthered by larger numbers of publications, Biology Prof. Paul Adler said. 

According to Pompano, the benefits of a stricter significance cutoff could include less false data in scientific literature. A lowered threshold could also reduce “p-hacking,” Asst. Biology Prof. Alan Bergland said. 

“In p-hacking, people can use websites or programs to find correlations between variables in their experiments, and this allows them to contort their results to fit their desired narrative,” Bergland said. “You can plot different variables against each other and come across correlations that are completely nonsense, but related. P-hacking would still be possible even if the threshold was lowered to 0.005, but certainly harder.” 

While the change in p-value may, by some extent, increase the reproducibility of data, researchers worry it could also inhibit scientific progress. A p-value of 0.005 is difficult to obtain when working with smaller sample sizes, which is often the case in pilot studies, human clinical trials and — for ethical reasons — when experimenting with live mammalian specimen, Pompano said. Ultimately, according to Adler, lowering the p-value would increase expenses, time needed to conduct experiments and false negatives — results that incorrectly demonstrate absence of a particular condition — within data. 

Additionally, although a p-value can determine statistical significance, it is unable to predict the applicability of experimental data to human life. 

“It cannot tell you if the ‘model’ for your data is right, or if your sample is representative of the population, or the probability that your hypothesis is true,” Kafadar said. “It can only tell you how consistent are your data with your hypothesis, assuming both that the sample is representative of the population and the model you are using is correct. If neither of those assumptions is true, the p-value may be misleading.” 

Due to such limits of the p-value, Adler and Pompano believe errors in experimental design — the setup of a procedure undertaken to test a hypothesis — are a more immediate source of defects in scientific validity. Both professors said a p-value change is unnecessary.  

“Essentially, you can’t just look at a p-value and decide if the results are reproducible. You have to look at the question being asked and if the experimental design that was being performed actually allows you to answer that question at all,” Pompano said. “And then, does the data support the answer that the author has concluded? I think the p-value alone is one small piece of assessing the conclusion of the experiment.” 

In other fields examining non-binary hypotheses, such as experimental physics, a p-value is rarely utilized and therefore unrelated to reproducibility errors. Rather, systematic uncertainties like varying machinery usage and ill-defined experimental design play larger roles in empirical blunders.

According to Physics Prof. Blaine Norum, reproducibility errors often encountered in physics are due to differing equipment types and apparatus setup from lab to lab. 

“The question is not a statistical question, but a question of systematic uncertainties — that is, machinery or experimental design — which are not addressed by a p-value,” Norum said. “How equipment is set up, how one configures it to get measurements varies between people, leading to reproducibility errors from lab to lab. A p-value is a statistically derived quantity, and it doesn’t address those issues.” 

Researchers have expressed that inconsistencies within published scientific data stem from flaws within the career structures of science, more specifically defined as an unstable job market and the immensely difficult nature of discovery, rather than statistical analyses. 

“In the structure of science, at least American science, a lot of the research is done by graduate students and post-doctoral fellows, so the only way for a faculty member to be successful and keep getting papers and grants is to have lots of people working for them – there’s a selective advantage to that,” Adler said. “But that only fuels the oversupply of scientists, meaning you have too many people chasing too few grant awards and people publishing less reliable data just for the sake of publishing a paper. And these problems are much more serious than the p-value.”

related stories