New AI System Reads Protein Sequences Without Databases

Protein identification sounds straightforward: zap a sample with a mass spectrometer, measure the fragments, match them against a database, and find your protein. It has worked for decades. But the system has a fatal flaw. It can only find proteins already in the database. Unknown organisms, newly evolved proteins, engineered molecules, and proteins bearing unexpected chemical modifications all slip through undetected.

Researchers have now developed a machine learning model that sidesteps this problem entirely. Called InstaNovo, the system reads peptide sequences directly from mass spectrometry data without consulting any database at all. In tests across eight different biological applications, it identified far more peptides than traditional methods and discovered novel proteins, unreported organisms, and previously missed cleavage sites in living cells.

The work represents a fundamental shift in how proteins can be identified at scale, opening avenues in drug development, microbiome studies, and the hunt for hidden proteins in the human body.

The Database Trap

Modern proteomics relies on what sounds like a simple idea: bottom-up protein analysis. Researchers dissolve a protein sample, chop the proteins into peptide fragments using enzymes, then run the peptides through a mass spectrometer. The instrument measures the weight of each fragment and how it breaks apart, creating a unique fingerprint for each peptide sequence.

To identify which peptide produced which fingerprint, scientists compare the data against a protein database. The database contains theoretical fingerprints for all known proteins in a given organism. If the spectrum matches something in the database, the peptide is identified.

This approach works well when you know what you are looking for. But it fails dramatically when you don't. A poor choice of database means missing entire proteins. Unknown organisms in a mixed sample remain invisible. Proteins with unexpected chemical modifications cannot be matched. Even small changes, like a single amino acid substitution or a cleavage event at an unusual site, fall outside the search space.

Worse, adding more possibilities to the search makes the computational cost skyrocket. Searching for just a few post-translational modifications exponentially multiplies the computation needed. For researchers studying things like wound infections, snake venom, or engineered antibodies, database search becomes impractical.

An alternative method exists: de novo peptide sequencing, which reads the sequence directly from the spectrum without a database. Imagine looking at a mass spectrometry fingerprint and reconstructing the amino acid sequence on the spot. The approach is more flexible, but traditional algorithms have suffered from poor accuracy and high false discovery rates, making them unreliable for large-scale experiments.

Teaching Machines to Read Spectra

The team turned to transformer neural networks, the same architecture that powers large language models. The insight was elegant: mass spectra and language have something in common. Both are sequences of information that need to be understood and translated.

The model, InstaNovo, takes a mass spectrum as input and outputs a peptide sequence. The architecture works in layers. First, it encodes the spectrum peaks with their intensities using multi-scale sinusoidal embeddings, a technique designed to represent data at different resolutions simultaneously. Think of it as looking at a landscape from multiple zoom levels at once.

The encoded spectrum then flows through nine transformer encoder layers. Each layer allows the model to cross-attend to different peaks in the spectrum, understanding how they relate to one another. A decoder then generates the peptide sequence amino acid by amino acid, from right to left, in the direction peptides typically fragment most strongly.

To ensure the model always outputs chemically valid sequences, the researchers implemented a constraint called knapsack beam search. This prevents the model from proposing sequences that don't match the measured molecular weight of the original peptide. It is a guardrail against nonsense.

The model was trained on ProteomeTools, the largest available peptide mass spectrometry dataset, containing synthetic peptides from human proteins analyzed on state-of-the-art instruments. With 2.6 million high-confidence spectra, this training set gave the model exposure to diverse peptides and fragmentation patterns.

Iteration Improves Precision

But transformer-based sequence prediction has a limitation: it generates tokens one at a time, left to right or right to left. The model makes its best guess at each position without reconsidering earlier decisions.

Researchers reasoned that humans approach de novo sequencing differently. They start with an initial, fuzzy prediction based on the most obvious, intense peaks in the spectrum, then refine it step by step, revisiting and correcting their interpretation as they look more carefully.

This insight led to InstaNovo+, a diffusion-based refinement model. Rather than predicting the sequence once, InstaNovo+ takes an initial sequence, corrupts it slightly, and then learns to denoise it step by step over 20 iterations. With each iteration, the model uses the spectrum data to guide corrections, much like slowly bringing a blurry image into focus.

When the team fed InstaNovo predictions into InstaNovo+, performance jumped. The diffusion model caught errors the transformer missed. Importantly, InstaNovo+ also identified peptides that InstaNovo missed entirely. The two models complemented rather than merely refined each other. Together, they identified 41.78 percent more peptides than the previous best competitor, Casanovo.

Discovery in Action

The real test came in eight biological applications spanning simple cell lysates, engineered antibodies, microbiome samples, and the hidden proteome.

In HeLa human cancer cells, InstaNovo alone identified more than 8,700 correct peptides and found 1,338 additional ones that database searching had missed entirely. That represents a 7.5 percent increase in coverage. For nanobodies, engineered immune proteins used in diagnostics and therapy, the model achieved 8-fold higher peptide detection when searching the complete spectrum space compared to traditional database restriction. Protein coverage for herceptin, a cancer drug, reached 92.87 percent for heavy chains and 100 percent for light chains.

In wound fluid samples from patients with leg ulcers, the model mapped albumin to over 1,200 spectra and detected proteins from five different bacterial species, including Pseudomonas aeruginosa and Escherichia coli. Researchers confirmed these organisms were actually present using PCR, validating the de novo predictions.

A metaproteomics study of a marine bacterial co-culture revealed the presence of five additional bacterial species that were not in the original reference database, including Phototrophicales bacterium and Candidatus scalindua arabica. The model discovered these organisms without any prior knowledge of their presence.

In immunopeptidomics, where researchers study the peptides presented on immune cell surfaces, InstaNovo identified 3,495 novel peptides compared to database search alone and increased the detection rate by 41.53 percent. The identified peptides showed sequence patterns consistent with how the immune system naturally processes proteins, giving confidence in the predictions.

For degradome studies, which investigate how protease enzymes cut proteins, the model predicted 4,635 new peptide sequences in HeLa cells incubated with the protease GluC. Importantly, 1,222 of these matched the known specificity profile of the enzyme, showing the model had learned real biology rather than generating noise.

Why This Matters

The implications extend well beyond academic curiosity. Drug developers can now sequence antibody therapeutics more thoroughly, improving quality control. Microbiome researchers can identify organisms in complex samples without pre-existing genomic data. Studies of single-cell proteomics, where detecting every possible peptide from minuscule protein amounts is crucial, could benefit enormously from improved detection rates.

The "dark proteome," the vast landscape of proteins and proteoforms currently invisible to standard methods, becomes more accessible. Alternative splicing variants, unexpected post-translational modifications, coding mutations, and proteins from cryptic organisms all become detectable in one sweep.

Scale matters too. Deep learning approaches have computational costs that increase linearly with the size of the spectrum database. Traditional database searching scales exponentially. For researchers analyzing millions of spectra, the practical advantage is enormous.

The models are already publicly available. Researchers can upload their data to a web interface and receive predictions without local installation. Code, model checkpoints, and training datasets are open source.

The work does not eliminate database searching entirely. Uncertainty remains about how well the models generalize to different mass spectrometers, sample types, and fragmentation methods. Fine-tuning on specific datasets will likely improve performance further. Post-processing and multivariate filtering to refine predictions could boost both sensitivity and specificity.

But for the first time, a general-purpose neural network has proven it can read peptide sequences from mass spectra with high accuracy across diverse biological contexts. It is a capability that feels like it belongs to a different era of protein science, one where the instrument tells the story directly, and the database serves only as a reference for validation rather than as the gatekeeper of discovery.

Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s42256-025-01019-5

Latest Jobs

New AI System Reads Protein Sequences Without Databases

New AI System Reads Protein Sequences Without Databases

New AI System Reads Protein Sequences Without Databases

The Database Trap

Teaching Machines to Read Spectra

Iteration Improves Precision

Discovery in Action

Why This Matters

Get insights bi-weekly

More from Health, Life and Sustainable Development Desk

Researchers Develop Synthetic Cells That Mimic Life and Death

Share this research

About the Author

Health, Life and Sustainable Development Desk

How Cellular Whips Generate Their Powerful 3D Beat

Invisible Jets: How Tiny Bubbles Could Revolutionize Drug Delivery to the Brain

Continue exploring

Researchers Develop Synthetic Cells That Mimic Life and Death

How Machine Learning Is Transforming Soccer Training Into Match Day Gold

The Ghost Structures That Guide Chemistry: How AI Is Learning to Predict Reaction Pathways

AI Is Finally Revealing What the Genome’s ‘Dark Matter’ Does