Every day, hospitals and pharmaceutical regulators make critical decisions about which drugs work best, which treatments are safest, and which therapies deserve insurance coverage. Increasingly, these decisions rely on data from actual patients in real clinical practice rather than controlled laboratory studies. But a sweeping investigation into 150 published studies reveals a troubling truth: many of these real-world findings are difficult to reproduce, and the problem often starts long before the science goes wrong.
The findings expose not a failure of scientific integrity, but rather a breakdown in something far more basic: scientists aren't explaining their work clearly enough for others to replicate it.
The Stakes of Real-World Evidence
Real-world evidence has become indispensable in modern medicine. Unlike randomized controlled trials, which are expensive and time-consuming, real-world studies extract insights from the vast digital footprints of millions of patients—their prescriptions, diagnoses, hospital visits, and outcomes all recorded in massive healthcare databases. This approach proved especially valuable during the COVID-19 pandemic, when researchers needed rapid answers about which treatments worked.
The power of this approach is undeniable. But that power comes with a price. When thousands of lines of computer code must translate raw data into meaningful medical conclusions, the room for hidden assumptions and misunderstandings multiplies.
"These studies can produce evidence on the effectiveness of medical interventions in clinical practice," researchers explained in their analysis. "But reproducibility of findings is essential to have confidence in decision-making."
A Systematic Test of Reproducibility
A team of investigators set out to answer a straightforward but rarely tested question: Can you reproduce the results of published real-world evidence studies if you have access to the same data and follow the same methods?
The answer turned out to be nuanced. Overall, the study showed strong correlations between original findings and reproduction attempts. When researchers compared effect sizes from original studies to their reproductions, a statistical measure of correlation came in at 0.85—a strong relationship suggesting most studies do hold up reasonably well.
Yet the details revealed cracks in the foundation.
For about one-fifth of the studies, the reproduced sample sizes were less than half or more than double the original. Some studies were intended to measure certain outcomes, but subtle ambiguities in how those outcomes were defined led to dramatically different results. In 11 percent of cases, the reproduced outcome risk differed by more than 10 percentage points from what the original authors reported.
Perhaps most troubling, the reproduction team discovered that no single reporting problem stood out as the primary culprit. Instead, reproducibility failures were consistently multifactorial—the result of accumulated ambiguities piling up.
The Clarity Crisis
When the investigators examined the published studies for reporting clarity, the gaps became stark. Only 54 percent of the 250 studies analyzed included a simple flow diagram or attrition table showing how many patients were included or excluded at each step. Only 8 percent provided a design diagram to clarify the overall structure of the research.
Even basic information frequently went unreported. When studies measured how long patients used a medication, the specific algorithms used to calculate that duration appeared in only 55 percent of papers. When researchers needed to define which patients qualified for inclusion, the exact criteria often remained ambiguous.
The median study in the analysis required the reproduction team to make assumptions about four different major categories of methodology to proceed. In other words, half of all studies left reproducers guessing about fundamental study design choices.
"Even studies that were closely reproduced often required considerable discussion within the team," the researchers noted, "sometimes with many assumptions about the original implementation decisions due to ambiguity in the methods description."
When Numbers Don't Match
Four detailed case studies illustrate how easily real-world evidence can diverge.
One study on chronic obstructive pulmonary disease left unclear when certain disease-confirming tests needed to be recorded before or after the diagnosis date that defined a patient's entry into the study. When the reproduction team made slightly different assumptions about this timing, it produced a 26 percent difference in the final sample size.
In another case involving breast cancer patients, the authors mentioned using a modified version of a standard comorbidity score but never explained what modifications they made. The original study reported that 97 percent of patients had a certain score, but the reproduction team found only 12 percent had that same score. Only when they hypothesized that the authors had removed tumor-related components from the score did the numbers nearly align.
A third study on benzodiazepines and death showed that the reproduced cohort was older and sicker than the original, with death rates 13 to 16 per 100 person-years higher. Investigation revealed the culprit: the data provider had retroactively updated historical years of data, changing the underlying population the study measured.
In a fourth case, an atrial fibrillation study left ambiguous whether certain diagnostic codes should come from inpatient settings only or include all settings. The two approaches produced effect estimates that differed by a factor of 2.3.
Why It Matters for Healthcare Decisions
Real-world evidence now influences decisions that shape healthcare for millions. Regulatory agencies, insurance companies, and hospital systems use these studies to decide which drugs to approve, which treatments to cover, and which therapies to recommend.
"Decision-makers seeking to synthesize real-world evidence to inform their regulatory, policy, or coverage decisions must devote substantial effort to parsing and evaluating the validity of the science behind the results," the researchers wrote. The problem is that most decision-makers lack the time or expertise to detect where studies went wrong.
When different interpretations of the same data can produce significantly different results, confidence in those findings erodes. This becomes particularly acute when published studies conflict with each other, or when a dramatic finding later fails to replicate.
The Path Forward
The good news is that reproducibility failures are not primarily scientific mistakes. Rather, they stem from incomplete communication. The solution is clearer, more detailed reporting.
International efforts are already underway. Multiple organizations have developed frameworks and templates designed to improve transparency in real-world evidence studies. Some proposed requirements include detailed descriptions of data sources and versions, exact specifications of study design choices, complete measurement algorithms for outcomes and covariates, and full documentation of statistical methods.
Just as randomized controlled trials now require detailed protocols and statistical analysis plans filed before research begins, real-world studies could benefit from similar transparency requirements. Some professional organizations have pushed for routine registration of hypothesis-testing real-world studies, similar to the public registration requirement for clinical trials.
The researchers found that when they contacted original authors to discuss their reproduction attempts, about half provided helpful clarification. This suggests that many authors simply assumed their descriptions were clear, rather than intentionally hiding methodological details.
A Calibration Point
This investigation, the largest systematic evaluation of real-world evidence reproducibility ever conducted, provides what the researchers call "an important calibration metric." It doesn't suggest that real-world evidence is unreliable or that these studies should be dismissed. Rather, it maps the terrain of where problems occur and why.
The correlation of 0.85 between original and reproduced effect sizes suggests that real-world evidence can produce robust findings. But the 15 percent gap—the subset of studies that didn't reproduce closely—represents a meaningful risk for healthcare decision-makers who don't dig beneath the surface.
What these researchers have exposed is not a crisis of dishonesty, but a transparency problem. When scientists explain their work poorly, even well-intentioned reproducers with access to the same data cannot reconstruct the original analysis.
"Greater methodological transparency aligned with new guidance may further improve reproducibility and validity assessment, thus facilitating evidence-based decision-making," the researchers concluded.
For a healthcare system increasingly reliant on real-world evidence, that transparency is no longer optional.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s41467-022-32310-3






