Can Copulas Be Used for Feature Selection? A Machine Learning Study on Diabetes Risk Prediction
- URL: http://arxiv.org/abs/2505.22554v1
- Date: Wed, 28 May 2025 16:34:58 GMT
- Title: Can Copulas Be Used for Feature Selection? A Machine Learning Study on Diabetes Risk Prediction
- Authors: Agnideep Aich, Md Monzur Murshed, Amanda Mayeaux, Sameera Hewage,
- Abstract summary: We introduce a feature-selection framework using the upper-tail dependence coefficient (lambdaU) of the novel A2 copula.<n>Our method prioritizes five predictors based on upper tail dependencies.<n>These features match or outperform MI and GA selected subsets across four classifiers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate diabetes risk prediction relies on identifying key features from complex health datasets, but conventional methods like mutual information (MI) filters and genetic algorithms (GAs) often overlook extreme dependencies critical for high-risk subpopulations. In this study we introduce a feature-selection framework using the upper-tail dependence coefficient ({\lambda}U) of the novel A2 copula, which quantifies how often extreme higher values of a predictor co-occur with diabetes diagnoses (target variable). Applied to the CDC Diabetes Health Indicators dataset (n=253,680), our method prioritizes five predictors (self-reported general health, high blood pressure, body mass index, mobility limitations, and high cholesterol levels) based on upper tail dependencies. These features match or outperform MI and GA selected subsets across four classifiers (Random Forest, XGBoost, Logistic Regression, Gradient Boosting), achieving accuracy up to 86.5% (XGBoost) and AUC up to 0.806 (Gradient Boosting), rivaling the full 21-feature model. Permutation importance confirms clinical relevance, with BMI and general health driving accuracy. To our knowledge, this is the first work to apply a copula's upper-tail dependence for supervised feature selection, bridging extreme-value theory and machine learning to deliver a practical toolkit for diabetes prevention.
Related papers
- CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction [0.0]
This study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class.<n>XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5%.<n>This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique.
arXiv Detail & Related papers (2025-06-18T22:21:40Z) - Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers [0.0]
Diabetes remains a significant health challenge globally, contributing to severe complications like kidney disease, vision loss, and heart issues.<n>Our study introduces an innovative diabetes prediction framework, leveraging both traditional ML techniques and advanced ensemble methods.<n>Central to our approach is the development of a novel model, DNet, a hybrid architecture combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers.
arXiv Detail & Related papers (2025-05-11T16:14:31Z) - From Glucose Patterns to Health Outcomes: A Generalizable Foundation Model for Continuous Glucose Monitor Data Analysis [47.23780364438969]
We present GluFormer, a generative foundation model for CGM data that learns nuanced glycemic patterns and translates them into predictive representations of metabolic health.<n>GluFormer generalizes to 19 external cohorts spanning different ethnicities and ages, 5 countries, 8 CGM devices, and diverse pathophysiological states.<n>In a longitudinal study of 580 adults with CGM data and 12-year follow-up, GluFormer identifies individuals at elevated risk of developing diabetes more effectively than blood HbA1C%.
arXiv Detail & Related papers (2024-08-20T13:19:06Z) - A data balancing approach towards design of an expert system for Heart Disease Prediction [0.9895793818721335]
Heart disease is a serious global health issue that claims millions of lives every year.
We employed five machine learning methods in this paper: Decision Tree (DT), Random Forest (RF), Linear Discriminant Analysis, Extra TreeBoost, and AdaBoost.
The accuracy of the Random Forest and Decision Tree model was 99.83%.
arXiv Detail & Related papers (2024-07-26T08:56:13Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Learning to diagnose cirrhosis from radiological and histological labels
with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset.
We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis.
This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z) - Secure and Privacy-Preserving Automated Machine Learning Operations into
End-to-End Integrated IoT-Edge-Artificial Intelligence-Blockchain Monitoring
System for Diabetes Mellitus Prediction [0.5825410941577593]
This paper proposes an IoT-edge-Artificial Intelligence (AI)-blockchain system for diabetes prediction based on risk factors.
The proposed system is underpinned by the blockchain to obtain a cohesive view of the risk factors data from patients across different hospitals.
Numerical experiments and comparative analysis were carried out between our proposed system, using the most accurate random forest (RF) model.
arXiv Detail & Related papers (2022-11-13T13:57:14Z) - Building Brains: Subvolume Recombination for Data Augmentation in Large
Vessel Occlusion Detection [56.67577446132946]
A large training data set is required for a standard deep learning-based model to learn this strategy from data.
We propose an augmentation method that generates artificial training samples by recombining vessel tree segmentations of the hemispheres from different patients.
In line with the augmentation scheme, we use a 3D-DenseNet fed with task-specific input, fostering a side-by-side comparison between the hemispheres.
arXiv Detail & Related papers (2022-05-05T10:31:57Z) - Development of a dynamic type 2 diabetes risk prediction tool: a UK
Biobank study [0.8620335948752806]
We developed a predictive 10-year type 2 diabetes risk score using 301 features from the UK Biobank dataset.
A Cox proportional hazards model slightly overperformed a DeepSurv model trained using the same features.
This tool can be used for clinical screening of individuals at risk of developing type 2 diabetes and to foster patient empowerment.
arXiv Detail & Related papers (2021-04-20T16:37:26Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Short Term Blood Glucose Prediction based on Continuous Glucose
Monitoring Data [53.01543207478818]
This study explores the use of Continuous Glucose Monitoring (CGM) data as input for digital decision support tools.
We investigate how Recurrent Neural Networks (RNNs) can be used for Short Term Blood Glucose (STBG) prediction.
arXiv Detail & Related papers (2020-02-06T16:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.