Related papers: STEM Rebalance: A Novel Approach for Tackling Imbalanced Datasets using SMOTE, Edited Nearest Neighbour, and Mixup

STEM Rebalance: A Novel Approach for Tackling Imbalanced Datasets using SMOTE, Edited Nearest Neighbour, and Mixup

URL: http://arxiv.org/abs/2311.07504v1
Date: Mon, 13 Nov 2023 17:45:28 GMT
Title: STEM Rebalance: A Novel Approach for Tackling Imbalanced Datasets using SMOTE, Edited Nearest Neighbour, and Mixup
Authors: Yumnah Hasan, Fatemeh Amerehi, Patrick Healy, Conor Ryan
Abstract summary: Imbalanced datasets in medical imaging are characterized by skewed class proportions and scarcity of abnormal cases. This paper investigates the potential of using Mixup augmentation to generate new data points as a generic vicinal distribution. We focus on the breast cancer problem, where imbalanced datasets are prevalent.
Score: 0.20482269513546458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imbalanced datasets in medical imaging are characterized by skewed class proportions and scarcity of abnormal cases. When trained using such data, models tend to assign higher probabilities to normal cases, leading to biased performance. Common oversampling techniques such as SMOTE rely on local information and can introduce marginalization issues. This paper investigates the potential of using Mixup augmentation that combines two training examples along with their corresponding labels to generate new data points as a generic vicinal distribution. To this end, we propose STEM, which combines SMOTE-ENN and Mixup at the instance level. This integration enables us to effectively leverage the entire distribution of minority classes, thereby mitigating both between-class and within-class imbalances. We focus on the breast cancer problem, where imbalanced datasets are prevalent. The results demonstrate the effectiveness of STEM, which achieves AUC values of 0.96 and 0.99 in the Digital Database for Screening Mammography and Wisconsin Breast Cancer (Diagnostics) datasets, respectively. Moreover, this method shows promising potential when applied with an ensemble of machine learning (ML) classifiers.

Related papers

Iterative Online Image Synthesis via Diffusion Model for Imbalanced Classification [29.730360798234294]
We introduce an Iterative Online Image Synthesis framework to address the class imbalance problem in medical image classification. Our framework incorporates two key modules, namely Online Image Synthesis (OIS) and Accuracy Adaptive Sampling (AAS) To evaluate the effectiveness of our proposed method in addressing imbalanced classification, we conduct experiments on the HAM10000 and APTOS datasets.
arXiv Detail & Related papers (2024-03-13T10:51:18Z)
Interpretable Solutions for Breast Cancer Diagnosis with Grammatical Evolution and Data Augmentation [0.15705429611931054]
We show how a new synthetic data generation technique, STEM, can be used to produce data to train models produced by Grammatical Evolution (GE) We test our technique on the Digital Database for Screening Mammography (DDSM) and the Wisconsin Breast Cancer (WBC) datasets. We demonstrate that the GE-derived models present the best AUC while still maintaining interpretable solutions.
arXiv Detail & Related papers (2024-01-25T15:45:28Z)
Few-shot learning for COVID-19 Chest X-Ray Classification with Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research. Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images. We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z)
MCRAGE: Synthetic Healthcare Data for Fairness [3.0089659534785853]
We propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE) to augment imbalanced datasets. MCRAGE involves training a Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes.
arXiv Detail & Related papers (2023-10-27T19:02:22Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z)
SC-MIL: Supervised Contrastive Multiple Instance Learning for Imbalanced Classification in Pathology [2.854576370929018]
Machine learning problems in medical imaging often deal with rare diseases. In pathology images, there is another level of imbalance, where given a positively labeled Whole Slide Image (WSI), only a fraction of pixels within it contribute to the positive label. We propose a joint-training MIL framework in the presence of label imbalance that progressively transitions from learning bag-level representations to optimal classifier learning.
arXiv Detail & Related papers (2023-03-23T16:28:15Z)
Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z)
Density-Aware Personalized Training for Risk Prediction in Imbalanced Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction. We propose a framework for training models for this imbalance issue. We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Statistical control for spatio-temporal MEG/EEG source imaging with desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques. The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge. We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.