AI Is Finally Revealing What the Genome’s ‘Dark Matter’ Does

AlphaGenome brings unprecedented clarity to genetic variants that don't code for proteins—and could reshape how doctors interpret disease risk.

Imagine trying to understand how a tiny change in a vast book might ripple through an entire library system. That's roughly the challenge geneticists face when trying to interpret what billions of genetic variants actually do in the human body. Most of the genome doesn't code for proteins at all. Yet when mutations occur in these non-coding regions—which make up nearly 98 percent of human genetic variation—they can cause disease in ways that remain frustratingly opaque.

Now, researchers have built an AI system powerful enough to cut through that opacity. Called AlphaGenome, the model represents a significant leap in predicting how genetic variants reshape the regulatory landscape of our cells. It reads a megabase of DNA sequence—about one millionth of the human genome—and simultaneously predicts thousands of molecular changes that might result from a genetic variant: whether gene expression increases or decreases, how splicing patterns shift, how tightly DNA packs in chromatin, where transcription factors bind, and more.

The results are striking. When tested against the strongest existing specialized tools, AlphaGenome matched or exceeded performance on 25 out of 26 benchmarks for predicting variant effects. More importantly, it does something no prior model could do: it assesses a variant's impact across all these dimensions in a single computational pass, offering a unified mechanistic view of how a genetic change cascades through the regulatory code.

For rare disease diagnosis, precision medicine, and understanding complex genetic diseases, this represents a meaningful step forward.

The Non-Coding Challenge

The puzzle begins with a fundamental asymmetry in the genome. Humans inherit roughly 4 to 5 million single-letter variants compared with a reference genome. Of those, fewer than 2 percent alter protein sequences. The rest sit in the "dark matter" of the genome: regions that control when and where genes turn on, how RNA gets processed, how DNA wraps around proteins, and countless other regulatory functions.

These non-coding variants are genuinely difficult to interpret. A single mutation might loosen chromatin structure in a way that allows a transcription factor to access its binding site, leading to overexpression of a nearby oncogene. Or it might disrupt a delicate motif that controls how RNA gets spliced, producing a truncated, non-functional protein. Or it might alter the strength of an enhancer, subtly changing expression levels in a tissue-specific way.

Existing computational models tried to make sense of this terrain. Some specialized tools excel at predicting whether a variant disrupts splicing. Others focus on gene expression. Still others predict chromatin accessibility. The problem: they all face fundamental tradeoffs. Models that work at base-pair resolution—necessary for detecting fine features like splice sites—struggle to process long DNA sequences because of computational limits. Models that can ingest longer sequences must sacrifice resolution, blurring critical regulatory details.

There was also a second, more insidious tradeoff: generality versus performance. Specialized models often outperformed generalist ones at their specific task, but users needed to deploy multiple models in sequence, patching together incomplete pictures of variant effects.

Building a Unified Model

AlphaGenome was designed to break both tradeoffs simultaneously. The system ingests one megabase of DNA sequence—a span large enough that 99 percent of validated enhancer-gene pairs fall within it, yet long enough to capture important distal regulatory context. It outputs predictions at base-pair resolution for some modalities (gene expression, chromatin accessibility, splicing) and at slightly coarser resolution (128 base pairs) for others where resolution is naturally limited by the experimental data, like transcription factor binding.

The architecture itself is elegant: a U-Net-inspired deep learning model that progressively downsamples the DNA sequence while extracting features, processes information through transformer blocks that can attend to long-range patterns, and then upsamples back to high resolution. Splitting computation across eight tensor processing units allowed the team to train on the full megabase without running out of memory.

Training employed a two-stage strategy. In the pretraining phase, fold-specific models learned to predict genome tracks from experimental data, partitioning the genome to create true held-out test regions. Then came distillation: a single "student" model learned to reproduce the predictions of an ensemble of "teacher" models, incorporating random mutations and sequence transformations during training. This final model proved remarkably efficient—it can score a variant for all modalities and cell types in under a second on a standard GPU.

Superior Performance Across the Board

The benchmarks paint a consistent picture. On 22 of 24 genome track prediction tasks, AlphaGenome achieved state-of-the-art performance when evaluated on previously unseen DNA sequences. On variant effect prediction—arguably more clinically relevant—it matched or exceeded the strongest competing model in 25 of 26 evaluations.

For gene expression, AlphaGenome improved upon the prior best model (Borzoi) by 14.7 percent at predicting cell-type-specific expression changes. For splicing, it developed a novel approach to predicting splice junctions alongside splice sites and splice site usage, providing a more complete picture of how variants disrupt RNA processing. The model even achieved state-of-the-art performance at predicting transcription factor binding and chromatin contact maps—despite not being explicitly specialized for those tasks.

One particularly telling result: when predicting the direction of effect for expression quantitative trait loci (eQTL) using a conservative score threshold, AlphaGenome recovered twice as many variants as Borzoi while maintaining 90 percent accuracy. This matters for clinical interpretation. Many disease-associated variants in genome-wide association studies (GWAS) can now be assigned a likely direction of effect—whether a variant increases or decreases expression of a candidate gene—which is crucial information for generating biological hypotheses.

Multimodal Interpretation of Disease

The real power of AlphaGenome emerges when its unified approach is applied to interpret complex variants. The researchers demonstrated this by analyzing three separate groups of oncogenic mutations affecting the TAL1 gene in T cell acute lymphoblastic leukemia. These mutations arise in different genomic contexts—some upstream of the gene, some intronic, some downstream—yet all converge on the same mechanism: driving overexpression of the TAL1 oncogene.

When AlphaGenome examined these variants, it predicted that one particular insertion created a binding motif for the MYB transcription factor. Increasing the active chromatin mark H3K27ac around that site. Decreasing repressive chromatin marks near the gene promoter. And ultimately driving elevated TAL1 expression in the relevant cell type (CD34 plus common myeloid progenitors, the closest available match to the cell of origin in leukemia).

All these predictions aligned with prior experimental work published separately. But critically, AlphaGenome generated them without any special training on these specific cancer variants or cell types. It learned the underlying regulatory grammar from the breadth of its training data.

Practical Implications

For rare disease diagnostics, this capability could prove transformative. Thousands of patients carry genetic variants of uncertain significance—changes that don't fit neatly into existing classification schemes. AlphaGenome could provide functional evidence that a variant genuinely disrupts a regulatory element, complementing conservation-based deletion scores and other existing tools.

The model also opens doors for therapeutic design. Its predictions of how variants affect splicing, expression and accessibility could accelerate development of antisense oligonucleotides that correct splicing defects, or guide the design of tissue-specific enhancers for gene therapy.

Beyond diagnostics, AlphaGenome serves as a powerful engine for hypothesis generation. Researchers studying a disease-associated locus can now generate detailed predictions about which specific regulatory elements are likely affected and which genes might be impacted—directing precious experimental resources toward the most promising candidates.

Limitations and Horizons

Like all models trained on current genomic data, AlphaGenome has limits worth acknowledging. It struggles with very distal regulatory elements more than 100 kilobases away, though its megabase-scale input helps more than previous models. Predicting tissue-specific effects remains imperfect. Its training emphasized protein-coding genes, leaving room for improvement on non-coding RNAs. And because it predicts molecular consequences rather than phenotypes, it cannot directly forecast whether a variant causes disease—that requires additional biological knowledge about gene function, development, and pathway context.

The researchers are transparent about these boundaries. They note that the model is trained primarily on human data, with more limited mouse coverage, and hasn't been benchmarked on personal genome interpretation—a known weak point for sequence-based prediction models across the field.

Yet the advances are undeniable. By unifying multimodal prediction, long-range sequence context, and base-pair resolution into a single framework, AlphaGenome clears a major hurdle in genomics: the ability to mechanistically interpret the vast majority of human genetic variation. For clinicians trying to explain a patient's genetic results, for researchers dissecting disease biology, and for developers designing therapeutic interventions, access to this kind of model represents a tangible shift in capability.

The team has made AlphaGenome available through an online API and released the code and model weights, democratizing access to what might otherwise remain locked inside a technology company. That openness, combined with the model's demonstrated accuracy and speed, suggests it will quickly become a foundational tool in how genetic variants get interpreted—and ultimately, how we understand the regulatory code written into our DNA.

Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s41586-025-10014-0

Medical Disclaimer: This article is for informational and educational purposes only and does not constitute medical advice, diagnosis, or treatment. Always seek the advice of your physician or another qualified health provider with any questions you may have regarding a medical condition. Never disregard professional medical advice or delay in seeking it because of something you have read in this publication.

Latest Jobs

AI Is Finally Revealing What the Genome’s ‘Dark Matter’ Does

AI Is Finally Revealing What the Genome’s ‘Dark Matter’ Does

AI Is Finally Revealing What the Genome’s ‘Dark Matter’ Does

AlphaGenome brings unprecedented clarity to genetic variants that don't code for proteins—and could reshape how doctors interpret disease risk.

The Non-Coding Challenge

Building a Unified Model

Superior Performance Across the Board

Multimodal Interpretation of Disease

Practical Implications

Limitations and Horizons

Get insights bi-weekly

More from Health, Life and Sustainable Development Desk

Researchers Develop Synthetic Cells That Mimic Life and Death

Share this research

About the Author

Health, Life and Sustainable Development Desk

How Cellular Whips Generate Their Powerful 3D Beat

Invisible Jets: How Tiny Bubbles Could Revolutionize Drug Delivery to the Brain

Continue exploring

The Ghost Structures That Guide Chemistry: How AI Is Learning to Predict Reaction Pathways

How AI Is Predicting House Prices With Unprecedented Accuracy

How to Get Better Answers From ChatGPT With Smarter Prompts

AI-Powered Robot Is Transforming How Scientists Discover New Molecules