A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data
- URL: http://arxiv.org/abs/2602.17364v1
- Date: Thu, 19 Feb 2026 13:48:10 GMT
- Title: A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data
- Authors: Justyna Andrys-Olek, Paulina Tworek, Luca Gherardini, Mark W. Ruddock, Mary Jo Kurt, Peter Fitzgerald, Jose Sousa,
- Abstract summary: We present CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework.<n>CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades.<n>We demonstrate that CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases.
- Score: 0.14658400971135646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning models are increasingly applied to biomedical data, yet their adoption in high stakes domains remains limited by poor robustness, limited interpretability, and instability of learned features under realistic data perturbations, such as missingness. In particular, models that achieve high predictive performance may still fail to inspire trust if their key features fluctuate when data completeness changes, undermining reproducibility and downstream decision-making. Here, we present CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework explicitly designed to address these challenges in small, heterogeneous, and incomplete clinical datasets. CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades. Using a real-world haematuria cohort comprising 568 patients evaluated for bladder cancer, we benchmark CACTUS against widely used machine learning approaches, including random forests and gradient boosting methods, under controlled levels of randomly introduced missing data. We demonstrate that CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases, including in sex-stratified analyses. Our results show that feature stability provides information complementary to conventional performance metrics and is essential for assessing the trustworthiness of machine learning models applied to biomedical data. By explicitly quantifying robustness to missing data and prioritising interpretable, stable features, CACTUS offers a generalizable framework for trustworthy data-driven decision support.
Related papers
- Uncertainty-Aware Data-Efficient AI: An Information-Theoretic Perspective [48.073471560778984]
In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data.<n>This review paper examines formal methodologies that address data-limited regimes through two complementary approaches.
arXiv Detail & Related papers (2025-12-04T21:44:22Z) - LUME-DBN: Full Bayesian Learning of DBNs from Incomplete data in Intensive Care [0.41292255339309664]
We propose a novel Gibbs sampling-based method for learning Dynamic Bayesian networks from incomplete data.<n>We evaluate our method on both simulated datasets and real-world intensive care data from critically ill patients.
arXiv Detail & Related papers (2025-11-06T13:13:39Z) - Robust Molecular Property Prediction via Densifying Scarce Labeled Data [53.24886143129006]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.
arXiv Detail & Related papers (2025-06-13T15:27:40Z) - Improving Omics-Based Classification: The Role of Feature Selection and Synthetic Data Generation [0.18846515534317262]
This study presents a machine learning based classification framework that integrates feature selection with data augmentation techniques.<n>We show that the proposed pipeline yields cross validated perfomance on small dataset.
arXiv Detail & Related papers (2025-05-06T10:09:50Z) - Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics [8.692647930497936]
We use conformal analysis to quantify the predictive uncertainty of a vision transformer based foundation model.<n>We show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model.
arXiv Detail & Related papers (2025-03-31T08:06:00Z) - Explainability of Point Cloud Neural Networks Using SMILE: Statistical Model-Agnostic Interpretability with Local Explanations [0.0]
This study explores the implementation of SMILE, a novel explainability method originally designed for deep neural networks, on point cloud-based models.
The approach demonstrates superior performance in terms of fidelity loss, R2 scores, and robustness across various kernel widths, perturbation numbers, and clustering configurations.
The study further identifies dataset biases in the classification of the 'person' category, emphasizing the necessity for more comprehensive datasets in safety-critical applications.
arXiv Detail & Related papers (2024-10-20T12:13:59Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.