Developing powerful AI language models has become an arms race of computational budgets. Training a state-of-the-art system can cost millions of dollars in electricity and hardware. But what if you could skip all that by simply combining existing models in clever ways?
That's precisely what a team in Tokyo demonstrated. Their method uses evolutionary algorithms — the same principles that drive biological evolution — to automatically discover how to merge different AI models into new ones with unexpected capabilities.
Think of it as breeding. But instead of crossing flowers to get new colors, they're crossing language models to get new abilities.
The approach challenges a fundamental assumption in AI development: that you need massive datasets and enormous computing power to create capable models. Instead, this work taps into the collective intelligence already embedded in open-source models scattered across the internet.
The Alchemy Problem
Model merging isn't new. Researchers have known for years that you can combine two AI models trained on the same base system and get something useful. Average their weights together — think of weights as the learned knowledge encoded in numbers — and sometimes the result performs better than either parent.
But the community treated it like alchemy. Success depended on intuition, trial and error, domain expertise. Given the explosion of available models, human intuition hits a ceiling fast.
The research team saw an opening. Evolution excels at exploring vast search spaces without requiring deep understanding of the underlying mechanisms. Natural selection doesn't need to understand genetics to produce complex organisms. Why should AI development require full understanding of model internals?
Two Spaces, One Framework
The methodology operates across two distinct dimensions. First: parameter space. This involves mixing the actual numerical weights of different models layer by layer. Imagine blending paint — red and blue make purple, but the exact shade depends on proportions and technique.
Second: data flow space. This preserves the original model weights intact but optimizes the path information takes as it moves through the network. Picture a token entering layer five of Model A, then jumping to layer twelve of Model B, then back to layer eight of Model A. The evolutionary algorithm searches for the sequence that produces the best results.
Both approaches are orthogonal. Combining them yields even stronger performance.
For parameter space merging, the team enhanced existing techniques called TIES-Merging and DARE. These methods intelligently handle conflicts when combining models and sparsify changes to focus on what matters. The evolutionary algorithm then optimizes mixing parameters for each layer — how much of each source model to include, how aggressively to filter out small changes.
Data flow space merging proved trickier. The search space explodes astronomically. With 64 total layers across two models and a path length of 60 steps, you're looking at 65 to the power of 60 possible configurations. Intractable, even for evolution.
The solution: constrain the search cleverly. Lay out all layers sequentially, repeat them a few times, then use an indicator array to include or exclude each position. Add scaling weights to handle distribution shifts when jumping between models. Suddenly the space becomes searchable.
A Japanese Math Model Nobody Asked For
To test their framework, the researchers set themselves an unusual challenge: create a Japanese language model capable of mathematical reasoning.
Why unusual? Because no training data exists for that specific combination at scale. Japanese language models exist. English math models exist. But merging across such different domains — language and subject matter — seemed like wishful thinking.
They selected three source models built on the same architecture: one Japanese language model and two English math specialists. The evolutionary algorithm optimized on translated math problems, searching for the configuration that maximized both correct numerical answers and proper Japanese explanations.
The results startled everyone involved.
The merged model achieved 52% accuracy on a Japanese math benchmark where the source Japanese model managed only 9.6% and the English math models scored 18.4% and 30%. Not just incremental improvement. Emergent capability.
More surprising: when evaluated on general Japanese language tasks, the merged model scored 70.5 — beating not only its source models but also previous state-of-the-art Japanese models with 70 billion parameters. Ten times larger. The 7-billion-parameter hybrid outperformed giants.
Analysis revealed why. The evolutionary search discovered that all three source models contributed essential information, but not through simple averaging. The optimized weights summed to nearly 2, suggesting amplification rather than interpolation. The algorithm learned to boost contributions rather than blend them.
The data flow experiments confirmed the Japanese language model's central role. When merging paths through the network, the algorithm consistently included most layers from the Japanese model early on, then alternated strategically with the math model. Removing the learned scaling parameters — forcing them to 1 — caused performance to drop over 20%.
Cross-Domain Fusion
Emboldened, the team tackled vision.
Vision-language models combine image understanding with text generation. They typically consist of three components: a vision encoder that processes images, a language model that generates descriptions, and a projection network that bridges them.
The researchers merged a Japanese language model with an English vision-language model, focusing only on the language component while keeping the vision encoder frozen.
They created two new Japanese benchmark datasets for evaluation. One tested general visual question answering in Japanese. The other, more demanding, assessed culturally specific content — images of Japanese scenes, objects, traditions.
Again: emergence. The merged vision-language models significantly outperformed baselines on both benchmarks. On culturally specific content, the improvement was dramatic. The evolved model generated detailed, accurate descriptions of Japanese cultural elements where the English baseline struggled.
Simple merging without evolution failed. Only the evolutionary approach discovered configurations that successfully integrated cross-domain capabilities.
Beyond Proof of Concept
After the research appeared online, the broader community ran with it.
Other teams applied the method to image generation models, successfully merging systems trained with completely different protocols. One notable case combined a specialized fast-generation model with standard models, despite their incompatible training procedures. It worked.
The technique has been implemented in widely-used open-source tools, making it accessible to anyone. Multiple derived models have emerged, each pushing boundaries in different directions.
This points toward a broader shift. Foundation model development currently follows an expensive paradigm: massive datasets, enormous compute budgets, months of training. Institutions and governments pour resources into building custom models from scratch.
But the evolutionary approach suggests an alternative. Leverage the rich ecosystem of existing open-source models. Combine them in novel ways discovered through automated search. Prototype quickly and cheaply before committing to full-scale development.
The implications extend beyond efficiency. As open-source models proliferate across languages, domains, and specialties, the space of possible combinations grows explosively. Human intuition cannot navigate that space. Evolution can.
What This Isn't
The merged models inherit their source models' limitations. The researchers noted instances of logically incoherent responses. Without additional instruction tuning or alignment, outputs can be factually wrong.
More fundamentally, evolutionary model merging doesn't replace training from scratch. It won't discover capabilities absent from any source model. It's about recombination, not creation ex nihilo.
Think of it as selective breeding versus genetic engineering. You can produce remarkable hybrid vigor by crossing existing varieties, but you cannot introduce genes that don't exist in your breeding stock.
The method also requires human decisions upfront: which source models to select, which tasks to optimize for, how to define success metrics. Those choices shape outcomes profoundly.
The Long Evolution Ahead
The research team views this as early days. They're exploring model swarms — populations of diverse models, each with specialized niches, continuously producing new internal world models through interaction.
They're investigating whether evolution can select source models automatically from vast pools of candidates rather than requiring human curation.
They're examining whether the approach scales to completely independent models never trained on the same base, resolving deeper incompatibilities.
The broader vision: foundation models that improve themselves through evolutionary dynamics rather than gradient descent. A fundamentally different development paradigm where progress emerges from population-level search rather than individual optimization.
Natural evolution produced human intelligence through selection operating on variation across billions of organisms over millions of years. Artificial evolution might produce machine intelligence through selection operating on models across digital repositories over hours.
Different substrate. Different timescale. Same fundamental principle: explore the space of possibilities, keep what works, iterate.
The surprising power of this approach — creating small models that outperform much larger ones, discovering cross-domain capabilities that seem to require explicit training — hints at something important. Maybe the path to more capable AI doesn't always run through bigger training runs and larger budgets.
Sometimes it runs through smarter recombination of what already exists.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s42256-024-00975-8






