Imagine you need to understand the relationships between a million people in a city. Who knows whom, how well, and in what context. To do this perfectly, you would need to examine every possible pair, which amounts to half a trillion interactions. Even with the fastest computers on Earth, that task would grind most research to a halt.
This is essentially the problem that scientists face every day in fields ranging from drug discovery to climate modelling. And for decades, the brute force approach — crunch everything, examine every relationship — was simply accepted as the cost of doing business.
But what if you didn't need to examine every pair? What if a clever mathematical strategy could give you results that are nearly just as accurate, using only a tiny fraction of the effort?
That is precisely what a new study published in one of mathematics' most prestigious journals has demonstrated. The research introduces a rigorous theoretical foundation for an algorithm called Randomly Pivoted Cholesky — or RPCholesky — and shows that it is not just fast, but provably excellent. The implications stretch from artificial intelligence to chemistry, physics, and beyond.
The Hidden Bottleneck in Modern AI
To understand why this matters, we need to talk about something called kernel methods. These are powerful mathematical tools used in machine learning to find patterns in data. Think of them as a way to measure similarity: how alike are two molecules? Two images? Two climate patterns?
The problem is that measuring similarity between every pair of data points gets expensive very quickly. If you have 10,000 data points, you need to fill in a table with 100 million entries. If you have 100,000 points, that table balloons to 10 billion entries. Doing anything useful with a table that enormous — solving equations, finding patterns, making predictions — requires enormous computational resources and time.
Mathematically, this table of pairwise similarities is called a kernel matrix, and working with the full matrix directly can require so many calculations that it becomes practically impossible at scale. Researchers have long known that these matrices often have a hidden simplicity buried inside them: most of their information is concentrated in a relatively small number of directions. In mathematical terms, the matrix is nearly low rank.
This insight has driven decades of research into approximation methods. The idea is simple in spirit: instead of filling in the entire table, find the most informative columns of that table, and reconstruct the rest from those. Get the right columns, and your approximation is almost as good as the real thing, at a fraction of the cost.
The hard part is knowing which columns to pick.
The Goldilocks Problem of Column Selection
Over the years, researchers developed several strategies for picking columns, each with its own strengths and weaknesses.
One approach, called greedy pivoting, always picks the column that looks most important at the current moment. It is like always grabbing the shiniest object in the room. This works well when the data is tidy and uniform, but it fails badly in the presence of outliers — rare but important data points that the greedy method becomes fixated on, at the expense of everything else.
Another approach is uniform random sampling, which just picks columns at random, ignoring all available information. This avoids the outlier problem but introduces a different flaw: it can completely miss small but crucial clusters within the data. In one of the paper's illustrative experiments, uniform sampling had a 99.6% chance of completely missing two small clusters of data points shaped like eyes within a larger dataset shaped like a smiling face.
More sophisticated methods, such as determinantal point process (DPP) sampling, come with stronger theoretical guarantees but require an enormous amount of computation to implement — sometimes between 80 and 200 times more work than simpler methods, making them impractical for large datasets.
Enter RPCholesky, which the authors describe as the Goldilocks solution. It neither always chases the biggest value nor ignores all information. Instead, at each step it randomly selects the next column with a probability proportional to how much unexplained variation that column still contains. Columns that carry more residual information are more likely to be picked, but columns with less information still have a chance. This balance between exploitation and exploration is, it turns out, exactly what is needed.
The authors connect all three strategies within a single mathematical framework involving temperature. Greedy pivoting is the "zero temperature" limit, obsessively locking onto the best option. Uniform sampling is the "infinite temperature" limit, ignoring all structure. RPCholesky sits at an intermediate temperature — not too hot, not too cold.
What the Math Guarantees
The theoretical heart of the paper is a proof that RPCholesky produces approximations that are nearly as good as mathematically possible, given the number of columns it examines.
To understand the significance of this, consider what "best possible" means in this context. There is a theoretical limit: no algorithm that selects k columns from a matrix can guarantee producing an approximation better than what is achievable with k divided by ε columns, where ε is the allowed error tolerance. The authors prove that RPCholesky achieves this near optimal performance, needing only a modest logarithmic factor more columns than the theoretical minimum.
In practical terms, RPCholesky examines only (k + 1) × N individual entries of the matrix to produce a rank-k approximation of an N × N matrix, and performs only on the order of k² × N additional arithmetic operations. Compare this to the brute force approach, which requires N³ operations — and you can see why, when N is large, the difference is not just convenient but transformative.
The researchers also proved that RPCholesky performs significantly better than greedy methods and uniform sampling in worst case scenarios. Greedy methods may require examining a number of columns proportional to the total size of the matrix N, while uniform sampling's performance can degrade badly based on how unevenly the important information is distributed. RPCholesky avoids both failure modes.
Putting It to the Test: Molecules and Drug Discovery
Theory is essential, but what does RPCholesky actually do for real science? The paper presents two compelling real world applications that demonstrate its power.
The first involves predicting the energy of a key part of organic molecules — specifically, the energy of the outermost electrons, a property called the highest occupied molecular orbital (HOMO) energy. This value determines how readily a molecule donates electrons, which is critical for understanding chemical reactivity and designing new drugs or materials.
Calculating HOMO energy from first principles using quantum mechanics is extremely computationally expensive. Researchers have been exploring whether machine learning, specifically a technique called kernel ridge regression, can learn to predict it accurately from molecular structure data alone.
The dataset used in this test, called QM9, contains information on roughly 134,000 organic molecules. Previous work had been limited to training on at most 64,000 molecules because processing the full dataset was too costly. By plugging RPCholesky into the computational pipeline, the researchers were able to use all 100,000 training points — at a cost of just five minutes on a standard laptop computer.
For the nine largest and most complex molecules in the test set — those with 29 atoms — RPCholesky delivered prediction errors that were 10 to 30 percent smaller than competing methods, and a remarkable 17 times smaller than simple random sampling. This matters enormously in drug discovery, where accurately characterising unusual or complex molecules is often precisely the point.
Decoding How Molecules Move: A Biophysics Success Story
The second application takes us into the world of molecular dynamics, the study of how molecules move and change shape over time.
Proteins and other biomolecules are not static objects. They flex, fold, and transition between different shapes called conformations, and these shape changes often determine their biological function. Understanding which conformations a molecule spends most of its time in — the so called metastable states — is a central challenge in computational biochemistry.
The researchers tested RPCholesky on a dataset of a small model molecule called alanine dipeptide, tracking its movements over 250 nanoseconds with position measurements taken every picosecond. This yielded 250,000 data points, each representing a configuration of the molecule in three dimensional space.
Directly applying the standard clustering algorithm to identify metastable states would require enormous computation. The standard workaround in the field has been to subsample the data heavily and then run analyses on high performance supercomputers, sometimes for days.
With RPCholesky accelerated clustering, the researchers processed all 250,000 data points in just 20 seconds on a laptop. The algorithm correctly identified the four metastable states of the molecule.
More impressively, with only 150 selected representative columns from the full dataset, RPCholesky produced a misclassification rate that was 9 to 14 times smaller than competing methods using the same number of columns. At that column count, the competing methods frequently produced completely wrong clusterings, while RPCholesky reliably found the correct structure.
A Comparison That Puts It All in Perspective
Across a testbed of 20 different real world datasets, the researchers compared RPCholesky against four alternative methods: uniform sampling, ridge leverage score sampling, greedy pivoting, and a blocked version of RPCholesky itself.
RPCholesky achieved the lowest approximation error on every single problem in the testbed. For one dataset, it achieved an error that was 80 million times smaller than uniform sampling and 5,000 times smaller than ridge leverage score sampling. Even against the greedy method, which also benefits from diagonal information, RPCholesky showed consistent advantages.
Just as notably, RPCholesky achieves this with a minimal number of individual data lookups. Unlike DPP sampling, which can require between 80 and 200 times as many data accesses as simpler methods, RPCholesky needs only one additional pass through the data beyond what uniform sampling requires.
Why This Matters Beyond Science
It is easy to see RPCholesky as a niche mathematical tool, but the broader picture is striking.
Kernel methods are used across a wide range of scientific and applied fields. Drug discovery relies on them to predict molecular properties. Climate scientists use them to model complex atmospheric patterns. Engineers apply them in signal processing and image recognition. Financial analysts have explored them for risk modelling. Anywhere that data contains rich structure and pairwise relationships matter, kernel methods — and therefore tools like RPCholesky — are potentially relevant.
The practical effect of RPCholesky is to make computations that were previously restricted to supercomputers feasible on a standard laptop. That democratises access to powerful machine learning for researchers and institutions that do not have access to massive computing infrastructure. It could accelerate scientific discovery in fields where data is abundant but computational resources are limited.
There is also a sustainability angle. Running large scale computations consumes significant energy. Algorithms that achieve the same or better results with a fraction of the computational work directly reduce the energy footprint of scientific computing.
A Simple Idea, Surprisingly Late to Arrive
Perhaps the most remarkable aspect of this story is how recently RPCholesky was recognised as a serious contender.
The core idea — selecting pivot columns with probability proportional to their remaining importance — had been briefly noted in a 2017 paper and appeared in a different context in 2020. But it had never been seriously analysed for the low rank approximation task that makes it so valuable. Earlier work on related algorithms dismissed them as computationally impractical, based on a misunderstanding of how they should be implemented.
The contribution of this research is not just to clean up the theory, but to bring a powerful, elegant algorithm into the light it deserves. The authors provide the first rigorous proof that RPCholesky achieves near optimal approximation guarantees, the first comprehensive numerical comparisons across real scientific datasets, and practical algorithms that can be implemented in just a few lines of code.
The code is publicly available, meaning any researcher can start using these methods immediately.
Looking Forward
The researchers are already exploring extensions of their work. A follow up paper has addressed how to speed up the algorithm further using a blocking strategy — processing multiple columns at once — while carefully preserving the accuracy that makes RPCholesky valuable.
The broader vision is a future where kernel based machine learning, with its strong theoretical foundations and interpretability advantages, can compete with deep neural networks on large scale problems. RPCholesky is a significant step in that direction.
Science has always advanced through the interplay of clever ideas and careful proof. Sometimes the most powerful tools are elegant in their simplicity. RPCholesky — sample randomly, but not blindly; exploit information, but not greedily — is a reminder that the right balance can make an enormous difference.
Publication Details: Year of Online Publication: 2024 (accepted); 2025 (published in print) Journal: Communications on Pure and Applied Mathematics Publisher: Wiley Periodicals LLC DOI: https://doi.org/10.1002/cpa.22234
Credit & Disclaimer: This article is based on the peer-reviewed research paper. All scientific facts, findings, and conclusions presented in this article are drawn directly from that original work. Readers are strongly encouraged to consult the full research article for complete data, mathematical proofs, and detailed findings.






