The Discovery No One Expected
For twenty years, mathematicians stared at the same stubborn number. The cap set problem—a deceptively simple puzzle about arranging points in space—had refused to budge beyond a particular threshold. Then a machine did what humans couldn't.
Not by thinking harder. By thinking differently.
The breakthrough came from FunSearch, a system that marries the creativity of large language models with the relentless optimization of evolutionary algorithms. Unlike tools that hallucinate plausible-sounding nonsense, this one produces verifiable mathematical truth. It discovered new constructions in extremal combinatorics and better algorithms for practical scheduling problems. The results can be checked, proven, and used.
When Three Points Make a Line
Picture a grid where each position holds one of three values: 0, 1, or 2. Now try to place marks on this grid with one iron rule—no three marks can ever sum to zero when you add their coordinates. This is the cap set problem.
Simple to state. Viciously hard to solve.
The problem matters beyond its elegant simplicity. It connects to classical questions about prime numbers and arithmetic progressions. Any progress here ripples outward to other areas—when researchers found tighter upper bounds for cap sets in 2017, the techniques immediately unlocked solutions to seemingly unrelated problems, including the notorious sunflower conjecture.
Cap sets belong to extremal combinatorics, a field obsessed with finding maximum or minimum sizes of mathematical objects under constraints. For cap sets in eight dimensions, the best-known construction contained 496 vectors. That record stood unbroken.
The Evolution of Ideas
FunSearch doesn't think like a mathematician or a traditional computer program. It thinks like an ecosystem.
The system starts with a skeleton—a basic program structure containing boilerplate code. But the crucial logic, the part that actually solves the problem, remains blank. This is what FunSearch must discover.
A large language model generates candidate solutions. These programs get executed and scored. The best performers survive. They're fed back into the model as examples, which generates new variations. These compete. The winners survive again. Evolution in action, but for algorithms instead of organisms.
Here's the clever bit: FunSearch maintains separate populations, called islands, that evolve independently. Every four hours, the weakest islands get wiped out and reseeded with the strongest programs from successful islands. This prevents the whole system from getting stuck in dead ends—a problem that plagues traditional evolutionary methods.
The system ran about one million samples for the results reported. Not thinking deeply for years, but trying vast numbers of variations quickly.
Breaking Through
For the eight-dimensional cap set problem, FunSearch found something new: a construction with 512 vectors.
Sixteen more than anyone had managed in two decades. The improvement might seem modest. It isn't. Finding larger cap sets gets exponentially harder as dimensions increase. The search space for eight dimensions contains roughly 3^1,600 possibilities—a number so large that brute force remains hopeless even with every computer on Earth.
Previous methods imposed restrictions on the search space, potentially missing optimal solutions. FunSearch explored the full space using a greedy algorithm skeleton. It iteratively added vectors with the highest priority scores, as determined by an evolved function, while ensuring no three vectors summed to zero.
The system didn't just discover the set of 512 vectors. It discovered a program that generates them. By inspecting this code, researchers extracted the underlying construction pattern—including a concept they call "reflections" that determines which vectors belong in the set based on symmetries between positions.
From Pure Math to Practical Packing
FunSearch proved its versatility on an entirely different problem: online bin packing.
Imagine items arriving one by one, each needing to fit in fixed-size bins. You must assign each item immediately, without knowing what comes next. How do you minimize wasted space? This isn't academic. Data centers assign computing jobs to servers. Factories cut materials. Logistics companies load trucks. These all reduce to bin packing.
Two classic heuristics dominate: first fit (use the first bin with enough space) and best fit (use the bin where the item leaves the least remaining space). Both date back decades. Both work reasonably well.
FunSearch discovered better ones.
On standard benchmark datasets, the new heuristics outperformed both classics across all problem sizes. On one test suite following real-world distributions, FunSearch's solution used only 0.03% excess bins for large instances—vastly better than first fit's 4.00% or best fit's 3.79%.
The discovered heuristics follow a counterintuitive strategy. Instead of always choosing the tightest fit, they pack items into tight spaces only when the fit is very tight. Otherwise, they prefer bins with more room. This avoids creating small gaps unlikely to be filled later—a subtle insight that emerges naturally from evolution.
The Asymptotic Frontier
Beyond concrete constructions, FunSearch pushed theoretical boundaries.
Mathematicians care not just about specific cap set sizes but about capacity—how fast the largest possible cap sets grow as dimensions increase. The theoretical upper bound stands at 2.756. Before this work, the best lower bound was 2.2180, established through painstaking construction of objects called admissible sets using specialized solvers.
FunSearch discovered an admissible set containing over 237,000 vectors that implies a new lower bound: 2.2202.
The system achieved this by first finding a particular symmetric admissible set. Researchers noticed this symmetry in the generated code and realized they could search the much smaller space of symmetric solutions directly. This human-machine feedback loop—where the AI suggests solutions, humans extract patterns, and those patterns guide further search—represents a new mode of mathematical discovery.
Why Programs Matter More Than Solutions
Most computational search methods output raw answers: lists of numbers, vectors, or assignments. FunSearch outputs programs.
This distinction transforms what's possible.
Programs scale. The admissible set with 237,984 vectors would be unwieldy as a mere list. The program generating it spans just a few lines. Programs generalize. The same bin packing heuristic works on instances far larger than those seen during training. Programs explain. Reading the code reveals why a solution works, enabling mathematicians to extract principles and push further.
The eight-dimensional cap set program explicitly checks for reflections—symmetric patterns where values at position i match those at position -i. This insight emerged from examining generated code, not from first principles. It connects to earlier mathematical constructions, like the Hill cap, but arrives via a completely different route.
Standard deployment matters too. A bin packing heuristic expressed as ordinary code can run anywhere. No specialized hardware. No neural networks that resist formal verification. Just clear logic that engineers can audit, modify, and trust.
The Creative Machine
FunSearch uses Codey, a large language model trained on vast amounts of code. Crucially, it receives no special training for these problems. No fine-tuning. No examples of cap set constructions or bin packing solutions. Just the generic ability to write syntactically correct programs.
The language model acts less as an oracle and more as a source of diverse, occasionally interesting proposals. Constrained by the program skeleton to focus only on the critical logic, it suggests marginal improvements over existing solutions in the population. Combined with evolution and systematic evaluation, these marginal improvements compound into breakthroughs.
The approach works precisely because most interesting problems have structure. Random solutions fail catastrophically. Good solutions often admit concise descriptions. FunSearch implicitly searches for programs with low Kolmogorov complexity—solutions that compress well, captured by short algorithms rather than long lists.
The distributed implementation matters practically. Fifteen samplers generate programs using the language model. One hundred fifty evaluators run them in parallel, scoring performance. This architecture keeps costs manageable while enabling the massive parallelism that broadens the search.
For some problems, like the eight-dimensional cap set, only four of 140 experiments succeeded. Others, like admissible sets and bin packing, improved on baselines every single run. This variability is honest. Discovery isn't deterministic.
The Limits and the Horizon
FunSearch thrives on problems with three properties: efficient evaluation functions, rich scoring feedback beyond binary pass/fail, and the ability to provide a skeleton isolating the hard part. Theorem proving fits none of these. Satisfiability problems fit all three.
The system cannot yet write complete programs from scratch or reason deeply about mathematical structure. The language model supplies variation, not insight. Evolution supplies optimization, not understanding. Together, they explore regions of program space that neither humans nor traditional methods naturally reach.
Current large language models sometimes confabulate—generating plausible but incorrect statements. FunSearch guards against this completely through systematic evaluation. Every generated program runs on concrete inputs. Incorrect programs get discarded. No hallucination survives.
What This Means
Scientific discovery has, until now, required scientists. Computational tools assisted, but creativity remained human. FunSearch demonstrates a different path: systems that generate verifiable knowledge about established open problems.
The cap set construction stands as mathematical fact, independent of how it was found. The bin packing heuristics improve real-world scheduling, regardless of their origin. These aren't lucky guesses or retrieval from training data. The improvements can be verified, the constructions proved correct.
This opens domains. Many problems in operations research, materials science, and algorithm design admit efficient evaluation but resist analytical solution. FunSearch's generality—the same system tackling pure mathematics and practical optimization without problem-specific tuning—suggests broad applicability.
The feedback loop between human and machine deserves emphasis. Researchers examined FunSearch's output, noticed symmetries, used those insights to refine the search, and obtained better results. This iterative collaboration, neither fully automated nor purely manual, might characterize future mathematical work.
As language models grow more capable and inference costs drop, systems like FunSearch will run faster, explore more thoroughly, and tackle harder problems. Automatically tailored algorithms could become standard practice. Mathematical discovery might shift from isolated breakthroughs to systematic exploration.
We built tools that assist us. Then tools that augment us. Now, perhaps, tools that discover alongside us.
The question isn't whether machines can replace mathematical intuition. It's whether we can build systems that expand what problems we can solve, what patterns we can find, what territories of knowledge become accessible. FunSearch suggests we can.
The twenty-year barrier fell. Others will follow.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s41586-023-06924-6






