Related papers: CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

URL: http://arxiv.org/abs/2506.17326v1
Date: Wed, 18 Jun 2025 22:21:40 GMT
Title: CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction
Authors: Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux,
Abstract summary: This study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class.<n>XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5%.<n>This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied four machine learning algorithms: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Our findings indicate that XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5% compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.

Related papers

Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z)
Can Copulas Be Used for Feature Selection? A Machine Learning Study on Diabetes Risk Prediction [0.0]
We introduce a feature-selection framework using the upper-tail dependence coefficient (lambdaU) of the novel A2 copula.<n>Our method prioritizes five predictors based on upper tail dependencies.<n>These features match or outperform MI and GA selected subsets across four classifiers.
arXiv Detail & Related papers (2025-05-28T16:34:58Z)
A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis [2.8661021832561757]
The SMOTEBoost method generates synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary.<n>This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations.<n>It incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data.
arXiv Detail & Related papers (2025-03-15T19:34:15Z)
Evaluating the Impact of Data Augmentation on Predictive Model Performance [0.05624791703748109]
This paper systematically compares data augmentation techniques and their impact on prediction performance.<n>Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01.<n>Some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance.
arXiv Detail & Related papers (2024-12-03T03:03:04Z)
Comprehensive Methodology for Sample Augmentation in EEG Biomarker Studies for Alzheimers Risk Classification [0.0]
Alzheimer's disease (AD), the leading type, accounts for 70% of cases.<n>EEG measures show promise in identifying AD risk, but obtaining large samples for reliable comparisons is challenging.<n>This study integrates signal processing, harmonization, and statistical techniques to enhance sample size and improve AD risk classification reliability.
arXiv Detail & Related papers (2024-11-20T10:31:02Z)
Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options. The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z)
AXIAL: Attention-based eXplainability for Interpretable Alzheimer's Localized Diagnosis using 2D CNNs on 3D MRI brain scans [43.06293430764841]
This study presents an innovative method for Alzheimer's disease diagnosis using 3D MRI designed to enhance the explainability of model decisions. Our approach adopts a soft attention mechanism, enabling 2D CNNs to extract volumetric representations. With voxel-level precision, our method identified which specific areas are being paid attention to, identifying these predominant brain regions.
arXiv Detail & Related papers (2024-07-02T16:44:00Z)
Estimating Heterogeneous Treatment Effects by Combining Weak Instruments and Observational Data [44.31792000298105]
Accurately predicting conditional average treatment effects (CATEs) is crucial in personalized medicine and digital platform analytics. We develop a novel approach to combine IV and observational data to enable reliable CATE estimation.
arXiv Detail & Related papers (2024-06-10T16:40:55Z)
Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation [56.13803674092712]
We propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR) CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations.
arXiv Detail & Related papers (2024-02-28T09:27:29Z)
AUC-mixup: Deep AUC Maximization with Mixup [47.99058341229214]
AUC is defined over positive and negative pairs, which makes it challenging to incorporate mixup data augmentation into DAM. We employ the AUC margin loss and soft labels into the formulation to effectively learn from data generated by mixup. Our experimental results demonstrate the effectiveness of the proposed AUC-mixup methods on imbalanced benchmark and medical image datasets.
arXiv Detail & Related papers (2023-10-18T03:43:11Z)
Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. Deep learning models have emerged as an efficient way to discover synergistic combinations. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z)
SOUL: An Energy-Efficient Unsupervised Online Learning Seizure Detection Classifier [68.8204255655161]
Implantable devices that record neural activity and detect seizures have been adopted to issue warnings or trigger neurostimulation to suppress seizures. For an implantable seizure detection system, a low power, at-the-edge, online learning algorithm can be employed to dynamically adapt to neural signal drifts. SOUL was fabricated in TSMC's 28 nm process occupying 0.1 mm2 and achieves 1.5 nJ/classification energy efficiency, which is at least 24x more efficient than state-of-the-art.
arXiv Detail & Related papers (2021-10-01T23:01:20Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Uncertainty-Aware Semi-supervised Method using Large Unlabelled and Limited Labeled COVID-19 Data [14.530328267425638]
We propose a Semi-supervised Classification using Limited Labelled Data (SCLLD) for automated COVID-19 detection. The proposed system is trained using 10,000 CT scans collected from Omid hospital. Our method significantly outperforms the supervised training of Convolutional Neural Network (CNN) in case labelled training data is scarce.
arXiv Detail & Related papers (2021-02-12T08:20:20Z)
CovidDeep: SARS-CoV-2/COVID-19 Test Based on Wearable Medical Sensors and Efficient Neural Networks [51.589769497681175]
The novel coronavirus (SARS-CoV-2) has led to a pandemic. The current testing regime based on Reverse Transcription-Polymerase Chain Reaction for SARS-CoV-2 has been unable to keep up with testing demands. We propose a framework called CovidDeep that combines efficient DNNs with commercially available WMSs for pervasive testing of the virus.
arXiv Detail & Related papers (2020-07-20T21:47:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.