A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning
- URL: http://arxiv.org/abs/2505.22554v4
- Date: Wed, 08 Oct 2025 04:03:38 GMT
- Title: A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning
- Authors: Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux,
- Abstract summary: We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient ($lambda_U$)<n>We benchmarked against Mutual Information, mRMR, ReliefF, and $L_1$ Elastic Net across four classifiers on two diabetes datasets.<n>We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective feature selection is vital for robust and interpretable medical prediction, especially for identifying risk factors concentrated in extreme patient strata. Standard methods emphasize average associations and may miss predictors whose importance lies in the tails of the distribution. We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient ($\lambda_U$), prioritizing variables that are simultaneously extreme with the positive class. We benchmarked against Mutual Information, mRMR, ReliefF, and $L_1$ Elastic Net across four classifiers on two diabetes datasets: a large public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Evaluation included paired statistical tests, permutation importance, and robustness checks with label flips, feature noise, and missingness. On CDC, our method was the fastest selector and reduced the feature space by about 52% while retaining strong discrimination. Although using all 21 features yielded the highest AUC, our filter significantly outperformed Mutual Information and mRMR and was statistically indistinguishable from ReliefF. On PIMA, with only eight predictors, our ranking produced the numerically highest ROC AUC, and no significant differences were found versus strong baselines. Across both datasets, the upper tail criterion consistently identified clinically coherent, impactful predictors. We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.
Related papers
- Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment [42.97456036889799]
Classical radiomic features are designed to quantify image appearance and intensity patterns.<n>Recent work on adaptive radiomics uses DL to predict feature weights over a radiomic pool, then thresholds these weights to retain the top-k features from large radiomic pool F.<n>We propose a patient-specific feature-set selection framework that predicts a single compact feature set per subject.
arXiv Detail & Related papers (2026-03-02T20:12:41Z) - ROOFS: RObust biOmarker Feature Selection [0.4065263202661619]
Roofs is a Python package designed to help researchers in the choice of FS methods adapted to their problem.<n>We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer.
arXiv Detail & Related papers (2026-01-08T17:41:07Z) - Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset [0.02030567625639093]
The development of accessible screening tools for early cancer detection in dogs represents a significant challenge in veterinary medicine.<n>This study assesses the feasibility of cancer risk classification using the Golden Retriever Lifetime Study cohort under real-world constraints.<n>It is concluded that while a statistically detectable cancer signal exists in routine lab data, it is too weak and confounded for clinically reliable discrimination from normal aging or other inflammatory conditions.
arXiv Detail & Related papers (2025-10-23T04:52:42Z) - Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction [44.23284500920266]
This benchmark standardises data curation and evaluation across two distinct clinical settings.<n>Experiments reveal that event stream models consistently deliver the strongest performance.<n>We find that feature selection strategies must be adapted to the clinical setting.
arXiv Detail & Related papers (2025-10-10T09:03:47Z) - CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction [0.0]
This study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class.<n>XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5%.<n>This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique.
arXiv Detail & Related papers (2025-06-18T22:21:40Z) - Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers [0.0]
Diabetes remains a significant health challenge globally, contributing to severe complications like kidney disease, vision loss, and heart issues.<n>Our study introduces an innovative diabetes prediction framework, leveraging both traditional ML techniques and advanced ensemble methods.<n>Central to our approach is the development of a novel model, DNet, a hybrid architecture combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers.
arXiv Detail & Related papers (2025-05-11T16:14:31Z) - From Glucose Patterns to Health Outcomes: A Generalizable Foundation Model for Continuous Glucose Monitor Data Analysis [47.23780364438969]
We present GluFormer, a generative foundation model for CGM data that learns nuanced glycemic patterns and translates them into predictive representations of metabolic health.<n>GluFormer generalizes to 19 external cohorts spanning different ethnicities and ages, 5 countries, 8 CGM devices, and diverse pathophysiological states.<n>In a longitudinal study of 580 adults with CGM data and 12-year follow-up, GluFormer identifies individuals at elevated risk of developing diabetes more effectively than blood HbA1C%.
arXiv Detail & Related papers (2024-08-20T13:19:06Z) - A data balancing approach towards design of an expert system for Heart Disease Prediction [0.9895793818721335]
Heart disease is a serious global health issue that claims millions of lives every year.
We employed five machine learning methods in this paper: Decision Tree (DT), Random Forest (RF), Linear Discriminant Analysis, Extra TreeBoost, and AdaBoost.
The accuracy of the Random Forest and Decision Tree model was 99.83%.
arXiv Detail & Related papers (2024-07-26T08:56:13Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - A comparative study on feature selection for a risk prediction model for
colorectal cancer [0.0]
This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models.
A visual approach proposed in this work allows to see that the Neural Network-based wrapper ranking is the most unstable while the Random Forest is the most stable.
arXiv Detail & Related papers (2024-02-07T22:14:14Z) - The Conditional Prediction Function: A Novel Technique to Control False
Discovery Rate for Complex Models [0.0]
We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models.
CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features.
arXiv Detail & Related papers (2023-10-07T21:16:09Z) - Learning to diagnose cirrhosis from radiological and histological labels
with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset.
We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis.
This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z) - Secure and Privacy-Preserving Automated Machine Learning Operations into
End-to-End Integrated IoT-Edge-Artificial Intelligence-Blockchain Monitoring
System for Diabetes Mellitus Prediction [0.5825410941577593]
This paper proposes an IoT-edge-Artificial Intelligence (AI)-blockchain system for diabetes prediction based on risk factors.
The proposed system is underpinned by the blockchain to obtain a cohesive view of the risk factors data from patients across different hospitals.
Numerical experiments and comparative analysis were carried out between our proposed system, using the most accurate random forest (RF) model.
arXiv Detail & Related papers (2022-11-13T13:57:14Z) - Building Brains: Subvolume Recombination for Data Augmentation in Large
Vessel Occlusion Detection [56.67577446132946]
A large training data set is required for a standard deep learning-based model to learn this strategy from data.
We propose an augmentation method that generates artificial training samples by recombining vessel tree segmentations of the hemispheres from different patients.
In line with the augmentation scheme, we use a 3D-DenseNet fed with task-specific input, fostering a side-by-side comparison between the hemispheres.
arXiv Detail & Related papers (2022-05-05T10:31:57Z) - Development of a dynamic type 2 diabetes risk prediction tool: a UK
Biobank study [0.8620335948752806]
We developed a predictive 10-year type 2 diabetes risk score using 301 features from the UK Biobank dataset.
A Cox proportional hazards model slightly overperformed a DeepSurv model trained using the same features.
This tool can be used for clinical screening of individuals at risk of developing type 2 diabetes and to foster patient empowerment.
arXiv Detail & Related papers (2021-04-20T16:37:26Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system.
Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model.
We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z) - Federated Deep AUC Maximization for Heterogeneous Data with a Constant
Communication Complexity [77.78624443410216]
We propose improved FDAM algorithms for detecting heterogeneous chest data.
A result of this paper is that the communication of the proposed algorithm is strongly independent of the number of machines and also independent of the accuracy level.
Experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets and on medical chest Xray images from different organizations.
arXiv Detail & Related papers (2021-02-09T04:05:19Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Short Term Blood Glucose Prediction based on Continuous Glucose
Monitoring Data [53.01543207478818]
This study explores the use of Continuous Glucose Monitoring (CGM) data as input for digital decision support tools.
We investigate how Recurrent Neural Networks (RNNs) can be used for Short Term Blood Glucose (STBG) prediction.
arXiv Detail & Related papers (2020-02-06T16:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.