STEM Rebalance: A Novel Approach for Tackling Imbalanced Datasets using
SMOTE, Edited Nearest Neighbour, and Mixup
- URL: http://arxiv.org/abs/2311.07504v1
- Date: Mon, 13 Nov 2023 17:45:28 GMT
- Title: STEM Rebalance: A Novel Approach for Tackling Imbalanced Datasets using
SMOTE, Edited Nearest Neighbour, and Mixup
- Authors: Yumnah Hasan, Fatemeh Amerehi, Patrick Healy, Conor Ryan
- Abstract summary: Imbalanced datasets in medical imaging are characterized by skewed class proportions and scarcity of abnormal cases.
This paper investigates the potential of using Mixup augmentation to generate new data points as a generic vicinal distribution.
We focus on the breast cancer problem, where imbalanced datasets are prevalent.
- Score: 0.20482269513546458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imbalanced datasets in medical imaging are characterized by skewed class
proportions and scarcity of abnormal cases. When trained using such data,
models tend to assign higher probabilities to normal cases, leading to biased
performance. Common oversampling techniques such as SMOTE rely on local
information and can introduce marginalization issues. This paper investigates
the potential of using Mixup augmentation that combines two training examples
along with their corresponding labels to generate new data points as a generic
vicinal distribution. To this end, we propose STEM, which combines SMOTE-ENN
and Mixup at the instance level. This integration enables us to effectively
leverage the entire distribution of minority classes, thereby mitigating both
between-class and within-class imbalances. We focus on the breast cancer
problem, where imbalanced datasets are prevalent. The results demonstrate the
effectiveness of STEM, which achieves AUC values of 0.96 and 0.99 in the
Digital Database for Screening Mammography and Wisconsin Breast Cancer
(Diagnostics) datasets, respectively. Moreover, this method shows promising
potential when applied with an ensemble of machine learning (ML) classifiers.
Related papers
- Iterative Online Image Synthesis via Diffusion Model for Imbalanced
Classification [29.730360798234294]
We introduce an Iterative Online Image Synthesis framework to address the class imbalance problem in medical image classification.
Our framework incorporates two key modules, namely Online Image Synthesis (OIS) and Accuracy Adaptive Sampling (AAS)
To evaluate the effectiveness of our proposed method in addressing imbalanced classification, we conduct experiments on the HAM10000 and APTOS datasets.
arXiv Detail & Related papers (2024-03-13T10:51:18Z) - Interpretable Solutions for Breast Cancer Diagnosis with Grammatical
Evolution and Data Augmentation [0.15705429611931054]
We show how a new synthetic data generation technique, STEM, can be used to produce data to train models produced by Grammatical Evolution (GE)
We test our technique on the Digital Database for Screening Mammography (DDSM) and the Wisconsin Breast Cancer (WBC) datasets.
We demonstrate that the GE-derived models present the best AUC while still maintaining interpretable solutions.
arXiv Detail & Related papers (2024-01-25T15:45:28Z) - Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - MCRAGE: Synthetic Healthcare Data for Fairness [3.0089659534785853]
We propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE) to augment imbalanced datasets.
MCRAGE involves training a Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes.
We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes.
arXiv Detail & Related papers (2023-10-27T19:02:22Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution.
Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z) - SC-MIL: Supervised Contrastive Multiple Instance Learning for Imbalanced
Classification in Pathology [2.854576370929018]
Machine learning problems in medical imaging often deal with rare diseases.
In pathology images, there is another level of imbalance, where given a positively labeled Whole Slide Image (WSI), only a fraction of pixels within it contribute to the positive label.
We propose a joint-training MIL framework in the presence of label imbalance that progressively transitions from learning bag-level representations to optimal classifier learning.
arXiv Detail & Related papers (2023-03-23T16:28:15Z) - Effective Class-Imbalance learning based on SMOTE and Convolutional
Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results.
In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs)
In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z) - Density-Aware Personalized Training for Risk Prediction in Imbalanced
Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction.
We propose a framework for training models for this imbalance issue.
We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Statistical control for spatio-temporal MEG/EEG source imaging with
desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques.
The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge.
We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.