Related papers: An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification

An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification

URL: http://arxiv.org/abs/2208.11852v1
Date: Thu, 25 Aug 2022 03:45:34 GMT
Title: An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification
Authors: Asif Newaz, Shahriar Hassan, Farhan Shahriyar Haq
Abstract summary: The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. Standard classification algorithms tend to perform poorly when trained on imbalanced data. We present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by redesigning the underlying classification algorithm to achieve desirable performance. The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. However, not all the strategies are useful or provide good performance in different imbalance scenarios. There are numerous approaches to dealing with imbalanced data, but the efficacy of such techniques or an experimental comparison among those techniques has not been conducted. In this study, we present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data. Rigorous experiments have been conducted on 50 datasets with different degrees of imbalance to thoroughly investigate the performance of these techniques. A detailed discussion of the advantages and limitations of the techniques, as well as how to overcome such limitations, has been presented. We identify some critical factors that affect the sampling strategies and provide recommendations on how to choose an appropriate sampling technique for a particular application.

Related papers

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification [0.8287206589886881]
This study comprehensively evaluates three widely-used strategies for handling class imbalance. We compare these methods against a baseline scenario of no-intervention across 15 diverse machine learning models. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold emerging as the most consistently effective technique.
arXiv Detail & Related papers (2024-09-29T16:02:32Z)
Few-shot learning for COVID-19 Chest X-Ray Classification with Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research. Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images. We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z)
A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning [129.63326990812234]
We propose a technique named data-dependent contraction to capture how modified losses handle different classes. On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment.
arXiv Detail & Related papers (2023-10-07T09:15:08Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework [12.856833690265985]
Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms.
arXiv Detail & Related papers (2022-04-07T20:13:55Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Imbalanced data preprocessing techniques utilizing local data characteristics [2.28438857884398]
Data imbalance is the disproportion between the number of training observations coming from different classes. The focus of this thesis is development of novel data resampling strategies.
arXiv Detail & Related papers (2021-11-28T11:48:26Z)
Influence-Balanced Loss for Imbalanced Visual Classification [9.958715010698157]
We derive a new loss used in the balancing training phase that alleviates the influence of samples that cause an overfitted decision boundary. In experiments on multiple benchmark data sets, we demonstrate the validity of our method and reveal that the proposed loss outperforms the state-of-the-art cost-sensitive loss methods.
arXiv Detail & Related papers (2021-10-06T01:12:40Z)
Handling Imbalanced Data: A Case Study for Binary Class Problems [0.0]
The major issues in terms of solving for classification problems are the issues of Imbalanced data. This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms. We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
arXiv Detail & Related papers (2020-10-09T02:04:14Z)
Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization [55.278153228758434]
Real-world datasets are heteroskedastic and imbalanced. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets.
arXiv Detail & Related papers (2020-06-29T01:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.