Amid a replication crisis in social science research, six-year study validates open science methods

After a series of high-profile research findings failed to hold up to scrutiny, a replication crisis rocked the social-behavioral sciences and triggered a movement to make research methods more rigorous. A six-year effort to test these emerging methods, led by labs at UC Santa Barbara, UC Berkeley, Stanford, and the University of Virginia, has shown they can produce new and highly replicable findings. The paper, published in the journal Nature Human Behavior and co-authored by Berkeley Haas Professor Leif Nelson, is the strongest evidence to date that the methodological reform movement known as open science can lead to more reliable research results.

Illustration showing colorful silhouettes of people
Image: Robert Kneschke for Adobe Stock

Roughly two decades ago, a community-wide reckoning emerged concerning the credibility of published literature in the social-behavioral sciences, especially psychology. Several large-scale studies attempted to reproduce previously published findings to no avail or to a much lesser magnitude, sending the credibility of the findings—and future studies in social-behavioral sciences—into question.  

To confront this crisis, a handful of top experts in the field set out to test whether emerging research practices can produce more reliable results. Over six years, researchers at labs from UC Santa Barbara, UC Berkeley, Stanford University, and the University of Virginia discovered and replicated 16 novel findings with ostensibly gold-standard best practices, including pre-registration, large sample sizes, and replication fidelity. 

Their findings, published in the journal Nature Human Behaviour, suggest that with high rigor, high replicability is achievable. 

“The major finding is that when you follow current best practices in conducting and replicating online social-behavioral studies, you can accomplish high and generally stable replication rates,”said UC Santa Barbara Distinguished Professor Jonathan Schooler, director of UCSB’s META Lab and the Center for Mindfulness and Human Potential, and senior author of the paper.  

Their study’s replication findings were 97% the size of the original findings on average. By comparison, prior replication projects observed replication findings that were roughly 50%.

The paper’s principal investigators were John Protzko of UCSB’s META Lab and Central Connecticut State University, Jon Krosnick of Stanford’s Political Psychology Research Group, Leif Nelson at the Haas School of Business, UC Berkeley, and Brian Nosek, who is affiliated with the University of Virginia and is the Executive Director of the  Center for Open Science.

“There have been a lot of concerns over the past few years about the replicability of many sciences, but psychology was among the first fields to start systematically investigating the issue,” said lead author Protzko, who is a research associate to Schooler’s lab, where he was a postdoctoral scholar during the study. He is now an assistant professor of psychological science at Central Connecticut State University. 

The question, Protzko said, was whether past replication failures and declining effect sizes are inherently built into the scientific domains that have observed them. “For example, some have speculated that it is an inherent aspect of the scientific enterprise that newly discovered findings can become less replicable or smaller over time,” he said. 

The researchers decided to perform new studies using emerging best practices in open science—and then to replicate them with an innovative design in which they committed to replicating the initial studies regardless of outcome.

“It is important to test the replicability of all outcomes,” said Nelson, the Ewald T. Grether Professor in Business Administration & Marketing at the Haas School of Business, UC Berkeley. Scientists and scientific journals will always prioritize emphasizing newly confirmed hypotheses, but consumers of science care just as much about the hypotheses that were not confirmed. We should care about the replicability of both outcomes.”

Replicating 16 new discoveries

Over the course of six years, research teams at each lab developed studies which were then replicated  by all of the other labs. In total, the coalition discovered 16 new phenomena and replicated each of them four times involving 120,000 participants.

Across the board, the project revealed extremely high replicability rates of their social-behavioral findings, and no statistically significant evidence of decline over repeated replications. Given the sample sizes and effect sizes, the observed replicability rate of 86%, based on statistical significance, could not have been any higher, the researchers pointed out.  

They also ran several follow-up surveys to test the novelty of their discoveries. “It would not be particularly interesting to discover that it is easy to replicate completely obvious findings,” Schooler said. “But our studies were comparable in their surprise factor to studies that have been difficult to replicate in the past. Untrained judges who were given summaries of the two conditions in each of our studies and a comparable set of two-condition studies from a prior replication effort found it similarly difficult to predict the direction of our findings relative to the earlier ones.” 

Indeed, many of the newly discovered findings have already been independently published in high-quality journals.

Because each research lab developed its own studies, they came from a variety of social, behavioral, and psychological fields such as marketing, political psychology, prejudice, and decision making. They all involved human subjects and adhered to certain constraints, such as not using deception. “We really built into the process that the individual labs would act independently,” Protzko said. “They would go about their sort of normal topics they were interested in and how they would run their studies.”  

Collectively, their meta-scientific investigation provides evidence that low replicability and declining effects are not inevitable, and that rigor-enhancing practices can lead to very high replication rates. Even so, identifying which practices work best will take further study. This study’s “kitchen sink” approach—using multiple rigor-enhancing practices at once—didn’t isolate any individual practice’s effect. 

The additional investigators on the study are Jordan Axt (Department of Psychology, McGill University, Montreal, Canada); Matt Berent (Matt Berent Consulting); Nicholas Buttrick (Department of Psychology, University of Wisconsin-Madison), Matthew DeBell (Institute for Research in Social Sciences, Stanford University), Charles R. Ebersole (Department of Psychology, University of Virginia), Sebastian Lundmark (The SOM Institute, University of Gothenburg, Sweden); Bo MacInnis (Department of Communication, Stanford University), Michael O’Donnell, (McDonough School of Business, Georgetown University); Hannah Perfecto (Olin School of Business, Washington University in St. Louis); James E. Pustejovsky (Educational Psychology Department, University of Wisconsin-Madison); Scott S. Roeder (Darla Moore School of Business, University of South Carolina); and Jan Walleczek (Phenoscience Laboratories, Berlin, Germany).