A doctor studies a mammogram, scanning through dozens of subtle patterns in the image. Some are harmless, others could signal trouble—but telling the difference isn’t always straightforward.
Machine learning can spot cancer with impressive accuracy. But here's the catch: these algorithms often function as black boxes, their reasoning opaque. In medicine, where lives hang in the balance, that opacity is unacceptable. Clinicians need to understand why a model flags certain images as suspicious.
A team from the University of Limerick and University of Galway has tackled this problem head-on, developing a method that not only identifies breast cancer more accurately but also reveals which features in mammograms truly matter.
The Triple Challenge
Breast cancer diagnosis through machine learning faces three formidable obstacles. First, datasets contain vastly more normal cases than cancerous ones, sometimes with positive cases representing just six percent of the total. This severe imbalance trains models to favor the majority class, potentially missing critical malignancies.
Second, high dimensionality. Medical images generate dozens of numerical features describing texture, shape, and intensity patterns. Many prove redundant or irrelevant. Sifting signal from noise becomes crucial.
Third, interpretability. Current explanation methods like SHAP and LIME carry significant drawbacks. SHAP calculations grow exponentially complex as features multiply, becoming computationally prohibitive. LIME suffers from sensitivity issues, producing inconsistent results depending on parameter choices and sample selection.
The researchers addressed all three simultaneously.
Evolution Meets Medicine
Their solution employs Grammatical Evolution, an approach inspired by biological evolution. Unlike conventional feature selection methods that rely on predetermined mathematical relationships, Grammatical Evolution generates and tests solutions iteratively, keeping those that perform best and discarding weaker candidates. Think of it as natural selection for mathematical expressions.
The method works by creating symbolic expressions that combine features in various ways. Across thirty independent runs, the algorithm tracks which features appear most frequently in successful solutions. Features that consistently contribute to accurate cancer detection rise to the top.
This differs fundamentally from standard approaches. Logistic Regression assumes linear relationships between features and outcomes. Extreme Gradient Boosting builds decision trees that can become opaque as complexity increases. Grammatical Evolution, by contrast, evolves interpretable rules while capturing non-linear relationships.
To address class imbalance, the team developed STEM, combining three existing techniques. The Synthetic Minority Oversampling Technique creates artificial positive cases by interpolating between existing ones. Edited Nearest Neighbor removes noisy samples that confuse classification. Mixup blends pairs of samples from the same class to improve generalization. Together, these methods balance datasets without simply duplicating minority cases.
Testing on Two Fronts
The study analyzed two datasets. The Digital Database for Screening Mammography contained 876 normal and 152 malignant images captured from two viewing angles. Each image was divided into three overlapping segments, with thirteen textural features extracted from each segment using four orientations, yielding fifty-two features per image.
The Wisconsin Breast Cancer dataset provided a different testing ground: thirty numerical features derived from fine needle aspiration samples, with 357 benign and 212 malignant cases.
Class imbalance varied dramatically. Some mammography configurations showed just six percent positive cases. The Wisconsin dataset proved less severe at thirty-seven percent positive.
The researchers compared three feature selection methods. Logistic Regression and Extreme Gradient Boosting represented established approaches. Grammatical Evolution was the challenger.
For each method, they identified the top five, ten, and fifteen features, then trained an ensemble of eight machine learning classifiers. The top three performers were combined through majority voting to make final predictions. Nine different data augmentation techniques were applied, with results compared against using the full feature set.
The Performance Gap
Results proved striking. When using all fifty-two features, Logistic Regression and Extreme Gradient Boosting achieved their highest Area Under the Curve scores. This metric balances sensitivity and specificity across different decision thresholds, providing a comprehensive performance measure.
Grammatical Evolution, however, peaked with just ten or fifteen features, outperforming the other methods even when they utilized the complete feature set. For the mammography dataset, Grammatical Evolution achieved AUC scores between 0.90 and 0.94 across different experimental configurations with reduced feature sets. Logistic Regression topped out between 0.66 and 0.72. Extreme Gradient Boosting reached 0.61 to 0.83.
Why the difference? Logistic Regression and Extreme Gradient Boosting prioritize features that contribute immediate accuracy gains rather than optimizing overall model performance. They often include marginally useful features that add little predictive power when isolated. Consequently, they require the entire feature set to maximize performance.
Grammatical Evolution's iterative evolution focuses on maximizing AUC directly. Only consistently informative features survive repeated evaluation across multiple runs. The approach optimizes the feature set for predictive performance rather than incremental gains.
For the Wisconsin dataset, results differed slightly. All three methods performed well, with AUC scores clustering between 0.96 and 0.99. The simpler, more structured nature of this dataset meant smaller feature subsets sufficed across all approaches.
Statistical analysis confirmed Grammatical Evolution's advantage. The DeLong test compared AUC scores between methods, revealing significant differences in thirteen of fifteen comparisons between Grammatical Evolution and Logistic Regression, and nine of fifteen comparisons between Grammatical Evolution and Extreme Gradient Boosting.
Among data augmentation techniques, STEM consistently outperformed alternatives, securing top rankings in four of five experimental setups.
Which Features Matter?
Feature importance analysis revealed another crucial distinction. When examining the top fifteen features selected by each method, Logistic Regression and Extreme Gradient Boosting typically identified a single dominant feature—usually contrast—with scores above 0.6 on a normalized scale. Other features registered minimal importance.
Grammatical Evolution consistently identified multiple features with scores exceeding 0.6. For mammography segments, it highlighted inverse difference moment, difference entropy, difference variance, angular second moment, and correlation. The Wisconsin dataset analysis identified radius, fractal dimension, and concavity measurements.
This diversity matters. Cancer detection rarely hinges on a single characteristic. Tissue texture exhibits complex, multifaceted patterns. A method that identifies multiple relevant features provides richer diagnostic information and more robust predictions.
The Interpretability Advantage
Beyond performance metrics, Grammatical Evolution offers something equally valuable: transparency. The evolved solutions are explicitly defined mathematical expressions, not nested decision trees or complex neural networks. Medical professionals can examine these expressions, understanding precisely how features combine to produce predictions.
Extreme Gradient Boosting's ensemble of decision trees grows increasingly complex as data complexity increases. Feature importance scores indicate which variables matter but reveal little about underlying relationships.
Logistic Regression assumes linear relationships between features and outcomes. Its coefficients rank feature contributions but become unstable and misleading when data exhibits multicollinearity. It cannot capture the non-linear relationships that characterize real biological systems.
Grammatical Evolution evolves complex, non-linear relationships without sacrificing interpretability. The evolved rules remain human-readable, allowing validation of the selection process and verification that selected features align with medical knowledge.
This transparency builds trust. Clinicians are more likely to adopt diagnostic tools when they understand the reasoning behind recommendations.
Looking Forward
The implications extend beyond breast cancer. Medical imaging generates high-dimensional data across numerous applications—lung nodule detection, diabetic retinopathy screening, skin lesion classification. Each faces similar challenges: class imbalance, high dimensionality, and the critical need for interpretability.
Current work focuses on mammography and fine needle aspiration data. Future research might explore wavelet transforms and additional feature types, potentially improving results further. Testing across imaging modalities would establish whether the approach generalizes broadly.
Another promising direction involves deep learning integration. Convolutional neural networks excel at automated feature extraction from images. Combining their pattern recognition capabilities with Grammatical Evolution's interpretable feature selection might yield systems that both perform exceptionally and explain their reasoning clearly.
The medical field has long grappled with a fundamental tension: the most accurate models are often the least interpretable, while interpretable models sometimes sacrifice performance. This research suggests that trade-off may be false. With the right approach, we can have both accuracy and transparency—and in medicine, both are essential.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1007/s42979-025-03840-6
Medical Disclaimer: This article is for informational and educational purposes only and does not constitute medical advice, diagnosis, or treatment. Always seek the advice of your physician or another qualified health provider with any questions you may have regarding a medical condition. Never disregard professional medical advice or delay in seeking it because of something you have read in this publication.






