The algorithm had a problem. Given six quintillion possible neural network architectures to evaluate, how could it find the best one without spending millennia training each candidate?
Neural architecture search aims to automate the design of deep learning systems. Feed it a dataset, let it run, and out comes a custom-tailored network optimized for that specific task. No human expertise required. Except there's a catch: the computational cost has historically been staggering. Early methods burned through 22,400 GPU-days—roughly sixty years of continuous computation on a single graphics processor—just to discover one architecture.
The research community responded by narrowing the search. Instead of designing entire networks, they focused on small building blocks called cells. Discover a good cell, stack it repeatedly, and you have a complete network. This modular approach slashed search times from thousands of GPU-days to just four.
But it created a new problem.
The Automation Paradox
Modular search leaves critical decisions unmade. Once you've discovered a cell, how many should you stack? How wide should each layer be? These questions—the network's macro architecture—still require manual trial and error. The very automation promised by neural architecture search remains incomplete.
There's another issue. Modular search spaces have become so constrained that even randomly sampled architectures perform well. When the worst network achieves 96.18% accuracy and the best reaches 97.56%, the search space lacks diversity. Good results come quickly, but the space cannot adapt to datasets of varying difficulty.
Researchers at the University of Cyprus decided to revisit the fundamentals. They designed a search space with extreme variance: worst network at 76.11% accuracy, best at 94.65%. Wide enough to offer tiny networks for simple tasks and complex architectures for challenging ones. Yet still navigable by an efficient algorithm.
Their framework addresses three interconnected challenges: designing the search space, evaluating candidates fairly, and navigating the space efficiently.
Trimming Without Sacrificing
The team started by analyzing which architectural variables matter most. Early networks followed a Conv-Pool-FC paradigm: convolutional layers extract features, pooling layers downsample, fully connected layers make predictions. Modern fully convolutional networks eliminate the rigid structure. Pooling layers? Replace them with strided convolutions. Fully connected layers? Use global pooling instead.
This simplification drops three search variables immediately.
They made another strategic choice regarding network width. Most methods search for the number of channels in every layer independently, creating combinatorial explosion. Instead, they search only for the initial layer width and apply a fixed doubling rule: whenever spatial dimensions halve, channels double. This pattern, borrowed from foundational architectures, maintains diversity while reducing complexity.
The resulting search space includes depth, width, operation type (separable or standard convolution), and kernel size (3×3 or 5×5). Four variables instead of eight. With depth ranging from 5 to 100 layers and width from 16 to 128 channels, the space still contains approximately 6.16 × 10^18 possible architectures.
For comparison, the widely adopted DARTS search space also contains about 10^18 candidates but has that narrow 96-97% accuracy range. The new space trades manual constraint for genuine diversity.
Training Isn't Universal
Here's where things get interesting. To rank candidate architectures, you need to know their performance. But training each network fully is prohibitively expensive—exactly the problem neural architecture search tries to solve.
Previous solutions approximated performance through weight sharing, where subnetworks inherit parameters from a large over-parameterized supernet. Fast, but inaccurate. Another approach: train everything briefly, rank based on early performance. Also fast, but misleading.
The researchers discovered something unexpected. They trained 240 randomly sampled networks for 50 epochs and examined correlations. Networks ranked after just one epoch of training correlated 0.65 with their final performance after 50 epochs. Decent, but not great. Then they tried ranking networks by parameter count alone: correlation of 0.49. Bigger networks weren't necessarily better.
But when they sorted 50 networks by parameter count and trained each for progressively more epochs—smallest for one epoch, next for two epochs, and so on—the correlation jumped to 0.91 with final performance.
The insight? Networks with more parameters benefit from longer training. A large network trained briefly might underperform a small network trained longer. Comparing them fairly requires architecture-aware training protocols.
They implemented dynamic learning rankings: when evaluating a candidate with more parameters than its predecessor, train it for an additional epoch. When comparing candidates with fewer parameters, train even longer to let superior architectures emerge despite reduced capacity. When parameters remain similar, maintain consistent training.
This approach improved ranking correlation by 20% compared to static training protocols, while still avoiding full training for every candidate.
Divide and Conquer
The search strategy itself splits the problem. Macro search discovers depth and width. Micro search refines operation types and kernel sizes.
The algorithm initializes with minimum depth, maximum width, and all layers using separable 3×3 convolutions. It grows the network layer by layer, training each variant for progressively more epochs (per the dynamic ranking scheme). Layers are added until accuracy plateaus or drops repeatedly.
Then it prunes. The network shrinks in width until reaching the minimum or until accuracy degrades. Since pruning reduces parameters, the algorithm compensates with extended training for each candidate.
After macro search establishes the architectural skeleton, micro search fine-tunes it. For each layer, the algorithm tries replacing separable convolutions with standard ones and updating kernel sizes from 3×3 to 5×5. But there's a constraint: whenever a layer uses standard convolution or larger kernels (which increase parameters), channel counts are reduced to keep total parameters approximately constant. This ensures performance gains come from better architecture, not simply from more learnable parameters.
The total number of evaluated candidates? Just 2×D_f + D' + W', where D_f is final depth and D', W' are networks evaluated during depth and width search. Instead of billions, perhaps a few hundred.
Results Across Domains
On CIFAR-10, the most widely used benchmark, the framework discovered networks achieving 4.09% error with just 0.46 million parameters in 0.24 GPU-days. The tiny model matched the efficiency of other compact networks while being 18 times faster to discover. The mobile-sized model achieved 3.17% error with 2.49 million parameters in 0.43 GPU-days—only outperformed by one global search method that required 11 times longer to find.
For CIFAR-100, both tiny and mobile models outperformed the most closely related global search method in accuracy, parameter efficiency, and search speed. The tiny model: 105 times smaller, 1.81% more accurate, 15 times faster to discover.
On EMNIST balanced, the tiny model beat the state-of-the-art WaveMix network despite being 6 times smaller. On KMNIST, it exceeded the previous best (a 300-million-parameter transformer) by 0.67% while using just 0.42 million parameters.
The framework achieves state-of-the-art results on these handwritten character datasets while remaining competitive on standard image classification benchmarks.
From Theory to Faces
To test transferability beyond toy datasets, the researchers adapted their framework for face recognition—a domain with distinct requirements and established architectures.
They modified the search space to use inverted ResNet structures, the backbone of leading face recognition systems. The fundamental variables remained unchanged: depth, width, operations, kernels. Only the architectural template adjusted.
Training on CASIA-WebFace (500,000 images at 112×112 pixels), they discovered two models. The small model: 58 layers with 256 final channels, totaling 14.6 million parameters. The medium model: 34 layers with 512 final channels, 37.1 million parameters.
On high-quality face verification datasets (LFW, CFP-FP, CPLFW, AgeDB, CALFW), the small model outperformed Adaface's ResNet-18 baseline while being half the size. Average accuracy: 93.62% versus 92.72%. The medium model beat Adaface's ResNet-50 on four of five datasets despite having 1.3 times fewer parameters.
For face identification on the challenging low-quality Tinyface dataset, both discovered networks achieved better rank-1, rank-5, and rank-20 retrieval than Adaface's corresponding baselines.
The search required 4 to 6 GPU-days due to the larger dataset and higher resolution—longer than image classification tasks but still practical. The team stopped at macro search only, suggesting micro refinement could yield further gains.
What Makes Networks Tick
The work reveals something subtle about network comparison. Training protocols matter as much as architecture. Comparing a large network trained briefly against a small network trained extensively produces misleading rankings. Yet most approximation methods apply uniform training across all candidates.
The dynamic ranking mechanism respects each network's learning capacity. Larger networks get more training to reach their potential. Smaller networks get even more to demonstrate their efficiency. Similar-sized networks train identically. This architecture-aware approach yields more accurate rankings with less total computation than training everything fully.
The macro-micro split offers another advantage: it discovers the right balance between network complexity and task difficulty before fine-tuning details. Depth and width establish the outer skeleton. Operation types and kernel sizes refine the interior structure. Each phase operates in a greatly reduced space.
The Efficiency Frontier
Neural architecture search has evolved through three phases. First generation: enormous search spaces, discrete optimization, full training for every candidate. Computationally ruinous but comprehensive. Second generation: constrained modular spaces, parameter sharing, gradient-based methods. Fast but incomplete, lacking macro architecture automation and architectural diversity. Third generation: diverse global spaces, architecture-aware evaluation, structured decomposition.
This framework occupies that third category. It automates end-to-end network design while remaining 2 to 4 times faster than previous global search methods. The discovered architectures compete with or exceed manually designed networks on multiple benchmarks.
The transferability to face recognition demonstrates that the design principles generalize beyond theory. The same search space philosophy—minimal variables, maximum diversity—adapts to different architectural templates and application domains.
Limitations remain. Large datasets and high-resolution images slow the search since candidates train on full data. Future work might explore data subsets for evaluation. The framework currently separates architecture search from training protocol search, but jointly optimizing both could yield better results.
The gap between automation promise and automation reality has narrowed. Network design for a given task no longer requires either prohibitive computation or manual intervention at critical junctures.
It just requires teaching the algorithm how to search smarter.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1007/s42979-025-03790-z






