Count the samples. That's the fundamental constraint in omics research.
Proteomics, metabolomics, transcriptomics—these technologies can measure thousands of biological molecules from a single blood draw. The instruments work beautifully. The chemistry is exquisite. The data pour out in high-dimensional torrents.
But each sample costs money. Each patient requires consent, coordination, collection. Clinical studies end up with 50, 100, maybe 200 participants if they're well-funded. The analysis tools—built for big data—choke on these small cohorts. Models overfit. Discoveries don't replicate. Patterns that look significant turn out to be noise.
A new framework flips this constraint on its head by recognizing that those 50 or 100 patients with molecular data also have something else: medical records. And so do thousands of other patients at the same hospital who never donated blood for research.
The system is called COMET—Clinical and Omics Multimodal Analysis Enhanced with Transfer Learning. It uses electronic health records from tens of thousands of patients to pretrain a neural network, then applies that knowledge to analyze molecular data from much smaller study cohorts.
In tests on two independent datasets—one predicting when pregnant patients would go into labor, another predicting cancer survival—COMET substantially outperformed traditional analysis methods. More importantly, it identified biologically meaningful proteins that aligned with known medical science, while baseline methods latched onto spurious correlations.
The Training Data Imbalance
Molecular biology moves fast. Electronic health records accumulate slowly, one patient at a time, one diagnosis, one prescription, one lab test.
But they accumulate for everyone who walks through the door. A major hospital system might have EHR data for hundreds of thousands of patients. Research budgets might fund proteomics assays for a few dozen.
This creates a lopsided data landscape. Clinical codes—the standardized labels for diagnoses, procedures, medications—exist in abundance. Molecular measurements exist for a privileged subset.
Traditional multimodal machine learning methods struggle with this imbalance. Early fusion approaches combine features before analysis but require complete data across all modalities. Late fusion methods analyze each data type separately then combine predictions, which works with missing data but can't learn interactions between modalities.
COMET's insight: use the abundant EHR data first, learn what you can, then transfer that knowledge to the problem where both data types exist.
Transfer Learning Architecture
The framework has three components: embedding longitudinal EHR data, pretraining an EHR-only model, then transferring weights to a multimodal architecture.
Electronic health records aren't designed for machine learning. They're lists of events: a diagnosis on Tuesday, a prescription on Friday, a lab test the following week. Different events happen at different times. Some patients have dense records with daily entries. Others have sparse data from occasional visits.
COMET treats these temporal sequences like sentences in natural language. Each day becomes a "sentence" where different medical events are "words." The system uses word2vec—a technique from computational linguistics—to learn embeddings for each medical code based on which other codes tend to occur around the same time.
These embeddings get fed into a recurrent neural network that processes the temporal sequence. The RNN learns patterns: which combinations of diagnoses, medications, and test results tend to precede which outcomes.
This happens during pretraining, using only patients who have EHR data. For the pregnancy study, that meant 30,843 patients. For the cancer study, 36,342 patients.
Once pretrained, the RNN weights get transferred into a larger multimodal network. This network has three branches: one processing EHR data through the pretrained RNN, one processing molecular measurements through a simple feed-forward network, and one combining both data types.
The key architectural choice: freeze the transferred RNN weights. Don't update them during the multimodal training phase. This forces the molecular and joint branches to adapt to what the EHR branch already knows, rather than allowing everything to shift.
Predicting Labor
The first test case was straightforward clinically but technically challenging: predict how many days until a pregnant patient goes into labor.
The research team collected blood samples from 61 patients throughout the last 100 days of pregnancy. Each sample was analyzed for 1,317 proteins using an aptamer-based platform. The EHR data came from Stanford Health Care's OMOP database, covering diagnoses, procedures, medications, lab results, and vital signs from the beginning of pregnancy up to the sampling time.
For the 30,843 patients without proteomics data, the system randomly sampled a timepoint in the last 100 days, used EHR data up to that point, and predicted days from that timepoint until delivery. This created a synthetic task that matched the structure of the real problem.
After pretraining on those 30,843 patients, the system was fine-tuned on the 61 patients with both EHR and proteomics data.
The Pearson correlation between predicted and actual days to labor was 0.868—strong enough for clinical utility. Root mean square error was 16 days.
Baseline comparisons showed where the improvement came from. Using only EHR data: r = 0.768. Using only proteomics: r = 0.796. Using both data types without pretraining (the "joint baseline"): r = 0.815.
The numbers tell part of the story. The biology tells the rest.
Finding Real Signal
When the team examined which proteins COMET considered important, they found molecules with established roles in pregnancy biology: interleukin-1 receptor-like 1, cystatin C, plexin-B2. These proteins are involved in immune regulation, placental development, and the inflammatory cascade that triggers labor.
The joint baseline—using the same architecture but without EHR pretraining—gave high importance to different proteins: soluble intercellular adhesion molecule 1, leucine-rich repeat transmembrane neuronal protein 1, angiopoietin-4. The first two have no known connection to pregnancy timing. Only angiopoietin-4 has documented relevance.
To validate this objectively, the researchers checked an external dataset of pregnant patients with proteomics measurements. The proteins COMET flagged as important showed stronger correlations with days to delivery (average |r| = 0.22) compared to proteins the baseline emphasized (average |r| = 0.12).
The pattern held when examining how EHR and proteomics data aligned. COMET's RNN representation of the EHR data showed 5,364 significant correlations with individual proteins. The baseline showed 3,201. More alignment means the model learned to represent clinical state in ways that connect to underlying molecular biology.
Specific examples revealed the mechanism. The protein interleukin-1 receptor-like 1 correlated with 76% of the RNN's latent dimensions in COMET but showed scattered correlations in the baseline. This protein is a known marker of preterm birth risk and labor onset timing. COMET's representation captured that biology; the baseline's didn't.
Cancer Prognosis
The second validation used an entirely different clinical problem: predicting which cancer patients would die within three years.
Data came from UK Biobank—36,901 patients with any cancer diagnosis, 559 of whom had blood samples analyzed for a median of 2,894 proteins at enrollment. Mortality was 5.5% in the omics cohort.
COMET achieved an AUROC of 0.842 and AUPRC of 0.504. The joint baseline reached 0.786 and 0.365 respectively. EHR-only: 0.749 and 0.205. Proteomics-only: 0.737 and 0.325.
Again, the biological interpretability distinguished COMET from baselines. Important proteins in COMET models included established cancer biomarkers: CEACAM5 (carcinoembryonic antigen-related cell adhesion molecule 5), KRT19 (cytokeratin 19), SDC1 (syndecan-1). These proteins are used clinically for cancer detection and monitoring.
The baseline models emphasized different proteins without clear prognostic relevance.
External validation on breast cancer proteomics data confirmed the pattern: 9 of 18 matching proteins from COMET's important set showed statistically significant associations with mortality. Only 8 of 18 from the baseline set did. Median p-value for COMET proteins was lower.
The cancer cohort showed less overlap between EHR and proteomics modalities than the pregnancy cohort—65.9% of proteins had no significant correlations with any EHR features. This makes biological sense. Cancer molecular heterogeneity exceeds pregnancy biology's temporal dynamics. The proteomics captures tumor-specific information that doesn't appear in routine clinical records.
Yet COMET still improved performance, suggesting the framework works across varying degrees of modality complementarity.
Regularization Through Initialization
How does pretraining on EHR data improve the analysis of molecular data?
The research team traced the effect through several analyses. First, they checked whether improvements occurred only in the EHR branch of the network or spread to other components. Predictions from intermediate nodes in the network—before final combination—revealed that both the proteomics branch and the joint branch performed better with COMET, not just the EHR branch.
This means the pretrained weights influence the entire network through backpropagation. When gradients flow backward during training, they pass through the frozen EHR representation, constraining how the proteomics and joint branches can adapt.
Second, they examined the relationship between training loss and test loss across epochs. In typical overfitting, training loss decreases while test loss increases or plateaus. COMET showed lower test loss for any given training loss compared to baselines—the signature of regularization.
Third, they visualized the parameter space. For each model at each training epoch, they collected all network outputs (including intermediate nodes) for all data points and reduced this to two dimensions using t-SNE. This creates a functional representation—networks with similar outputs cluster together regardless of actual parameter values.
COMET models and baseline models occupied different regions of this space. The trajectories through training also differed. Baselines wandered. COMET converged toward a specific region associated with better generalization and biological accuracy.
The pretrained weights act as an anchor. They initialize the network in a part of parameter space that already captures meaningful clinical patterns. The molecular data fine-tunes from there rather than searching from scratch.
Beyond Binary Labels
Traditional omics studies divide patients into cases and controls. Disease versus healthy. Responders versus non-responders. Survivors versus deceased.
This binary reduction discards information. Two patients might both be "cases" but differ substantially in disease severity, comorbidities, treatment history, trajectory. The molecular biology differs too.
EHR data captures this heterogeneity. Diabetes with complications looks different from well-controlled diabetes. Early-stage cancer differs from metastatic disease. Recent infection differs from chronic inflammation.
COMET learns these distinctions during pretraining, then applies them when analyzing molecular data. A protein elevated in one patient might mean something different than the same elevation in another patient with a different clinical context.
The cancer analysis illustrated this. Some patients had complex multi-system disease. Others had isolated tumors. Some received aggressive treatment. Others chose palliative care. These clinical differences influenced the molecular signatures and their prognostic meaning.
By incorporating this context, COMET moves beyond simple correlation ("this protein is high in people who died") toward conditional interpretation ("this protein matters most in patients with these clinical features").
Generalizability Tests
The researchers tested COMET against several alternative approaches to confirm the benefits came from the specific architectural choices rather than just having more data.
They compared against ridge regression with and without incorporating prior knowledge from pretraining. Ridge regression with priors showed some improvement (Pearson correlation from 0.572 to 0.799 in pregnancy prediction) but COMET still outperformed.
They tested a transformer-based architecture instead of the RNN. The transformer variant performed nearly as well (r = 0.848 vs 0.868 for pregnancy, AUROC = 0.842 vs 0.842 for cancer), confirming the benefit comes from transfer learning generally, not RNN-specific properties.
They tried training COMET-style models with metabolomics data instead of proteomics for the pregnancy cohort. Again, COMET exceeded the metabolomics-only baseline (r = 0.839 vs 0.758).
Across architectures, data modalities, and prediction tasks, the pattern held: pretraining on EHR data improves omics analysis.
Limitations and Extensions
The framework requires labeled outcomes in the pretraining dataset. For pregnancy prediction, this meant knowing delivery dates. For cancer, knowing mortality. Future work will explore self-supervised pretraining tasks that don't require specific labels—for example, predicting future EHR events from past events.
The current implementation uses OMOP-formatted EHR data, which requires manual mapping and may contain errors. Direct integration with raw EHR systems could reduce processing overhead and potential mapping mistakes.
The multimodal architecture assumes relatively simple processing of molecular data—essentially a feed-forward network. Some omics modalities (spatial transcriptomics, imaging mass spectrometry, longitudinal metabolomics) have complex structure that might benefit from more sophisticated architectures. Whether COMET's regularization benefits persist with more complex omics processing remains to be tested.
The cancer analysis showed modest overlap between EHR and proteomics signals. About two-thirds of proteins had no significant correlations with clinical features. This suggests opportunities for methods that can learn from each modality independently while still leveraging cross-modal interactions where they exist.
Sample size requirements aren't fully characterized. The pretraining cohorts were large (30,000+), but how much pretraining data is needed for meaningful benefit? The omics cohorts were small (61 and 559 patients), but where do the gains saturate as omics cohorts grow?
Implications
Most omics studies proceed from sample collection → assay → analysis. The cohort is what it is. Statistical power follows from sample size.
COMET suggests a different approach: identify the study population → collect molecular data from a subset → leverage clinical data from everyone.
This doesn't eliminate the value of larger cohorts. More molecular measurements always help. But it means the effective sample size for some analyses exceeds the number of samples that went through expensive assays.
For clinical translation, this matters. A hospital system planning a proteomics study of 100 patients can now effectively incorporate information from tens of thousands of patients in that same system who have EHR data. The molecular study becomes more statistically powerful without requiring more samples.
For biological discovery, the regularization effect is crucial. Models trained on small omics cohorts tend to overfit, learning patterns that reflect the specific patients in the study rather than generalizable biology. COMET's constraint—that the model must make sense in the context of what's known from clinical data—pushes toward more robust biological patterns.
The framework is hypothesis-agnostic. It doesn't require knowing which clinical features matter or which molecules matter. It learns those relationships from data. But it's not a black box either. The feature importance analyses and correlation patterns provide biological interpretability.
Perhaps most importantly, COMET changes how we think about multimodal data integration. The problem isn't just "how do we combine different data types?" It's "how do we use abundant data to make the most of scarce data?"
Electronic health records accumulate as a byproduct of clinical care. They're imperfect, noisy, incomplete. But they're available at scale. Treating them as a resource for omics analysis—not just as metadata or covariates, but as a pretraining corpus—opens new possibilities for extracting biological insight from limited molecular measurements.
The next proteomics study, the next metabolomics study, the next transcriptomics study—they don't need to start from scratch. They can start from what thousands of previous patients already taught the system through their medical records.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s42256-024-00974-9






