A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis
- URL: http://arxiv.org/abs/2503.12239v1
- Date: Sat, 15 Mar 2025 19:34:15 GMT
- Title: A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis
- Authors: Soufiane Bacha, Huansheng Ning, Belarbi Mostefa, Doreen Sebastian Sarwatt, Sahraoui Dhelim,
- Abstract summary: The SMOTEBoost method generates synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary.<n>This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations.<n>It incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data.
- Score: 2.8661021832561757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate illness diagnosis is vital for effective treatment and patient safety. Machine learning models are widely used for cancer diagnosis based on historical medical data. However, data imbalance remains a major challenge, leading to hindering classifier performance and reliability. The SMOTEBoost method addresses this issue by generating synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary and can produce noisy samples. This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations. Firstly, RE-SMOTEBoost focuses on generating synthetic samples in overlapping regions to better capture the decision boundary using roulette wheel selection. Secondly, it incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data. Thirdly, we introduce a double regularization penalty to control the synthetic samples proximity to the decision boundary and avoid class overlap. These enhancements enable higher-quality oversampling of the minority class, resulting in a more balanced and effective training dataset. The proposed method outperforms existing state-of-the-art techniques when evaluated on imbalanced datasets. Compared to the top-performing sampling algorithms, RE-SMOTEBoost demonstrates a notable improvement of 3.22\% in accuracy and a variance reduction of 88.8\%. These results indicate that the proposed model offers a solid solution for medical settings, effectively overcoming data scarcity and severe imbalance caused by limited samples, data collection difficulties, and privacy constraints.
Related papers
- TarDiff: Target-Oriented Diffusion Guidance for Synthetic Electronic Health Record Time Series Generation [26.116599951658454]
Time-series generation is crucial for advancing clinical machine learning models.
We argue that fidelity to observed data alone does not guarantee better model performance.
We propose TarDiff, a novel target-oriented diffusion framework that integrates task-specific influence guidance.
arXiv Detail & Related papers (2025-04-24T14:36:10Z) - Wafer Map Defect Classification Using Autoencoder-Based Data Augmentation and Convolutional Neural Network [4.8748194765816955]
This study proposes a novel method combining a self-encoder-based data augmentation technique with a convolutional neural network (CNN)
The proposed method achieves a classification accuracy of 98.56%, surpassing Random Forest, SVM, and Logistic Regression by 19%, 21%, and 27%, respectively.
arXiv Detail & Related papers (2024-11-17T10:19:54Z) - Improving EEG Classification Through Randomly Reassembling Original and Generated Data with Transformer-based Diffusion Models [12.703528969668062]
We propose a Transformer-based denoising diffusion probabilistic model and a generated data-based augmentation method.
For the characteristics of EEG signals, we propose a constant-factor scaling method to preprocess the signals, which reduces the loss of information.
The proposed augmentation method randomly reassembles the generated data with original data in the time-domain to obtain vicinal data.
arXiv Detail & Related papers (2024-07-20T06:58:14Z) - ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models.
Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z) - Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic
Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment.
The scarcity of annotated data limits the effectiveness and generalization of existing methods.
We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z) - Data Augmentation for Seizure Prediction with Generative Diffusion Model [34.12334834099495]
We propose a novel diffusion-based DA method called DiffEEG.<n>It can fully explore data distribution and generate samples with high diversity.<n>With the contribution of DiffEEG, the Multi-scale CNN achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-06-14T05:44:53Z) - Improved Techniques for the Conditional Generative Augmentation of
Clinical Audio Data [36.45569352490318]
We propose a conditional generative adversarial neural network-based augmentation method which is able to synthesize mel spectrograms from a learned data distribution.
We show that our method outperforms all classical audio augmentation techniques and previously published generative methods in terms of generated sample quality.
The proposed model advances the state-of-the-art in the augmentation of clinical audio data and improves the data bottleneck for the design of clinical acoustic sensing systems.
arXiv Detail & Related papers (2022-11-05T10:58:04Z) - SFF-DA: Sptialtemporal Feature Fusion for Detecting Anxiety
Nonintrusively [16.170315080992182]
We propose a framework based on "3CNND+LSTM" and fused similarity features of facial behavior and noncontact physiology.
Our framework was validated with dataset from the real world and two public datasets.
arXiv Detail & Related papers (2022-08-12T01:20:51Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - Statistical control for spatio-temporal MEG/EEG source imaging with
desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques.
The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge.
We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z) - Rectified Meta-Learning from Noisy Labels for Robust Image-based Plant
Disease Diagnosis [64.82680813427054]
Plant diseases serve as one of main threats to food security and crop production.
One popular approach is to transform this problem as a leaf image classification task, which can be addressed by the powerful convolutional neural networks (CNNs)
We propose a novel framework that incorporates rectified meta-learning module into common CNN paradigm to train a noise-robust deep network without using extra supervision information.
arXiv Detail & Related papers (2020-03-17T09:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.