Related papers: Data Balancing Strategies: A Survey of Resampling and Augmentation Methods

Data Balancing Strategies: A Survey of Resampling and Augmentation Methods

URL: http://arxiv.org/abs/2505.13518v1
Date: Sat, 17 May 2025 12:15:28 GMT
Title: Data Balancing Strategies: A Survey of Resampling and Augmentation Methods
Authors: Behnam Yousefimehr, Mehdi Ghatee, Mohammad Amin Seifi, Javad Fazli, Sajed Tavakoli, Zahra Rafei, Shervin Ghaffari, Abolfazl Nikahd, Mahdi Razi Gandomani, Alireza Orouji, Ramtin Mahmoudi Kashani, Sarina Heshmati, Negin Sadat Mousavi,
Abstract summary: Imbalanced data poses a significant obstacle in machine learning.<n>Various resampling strategies have been developed to mitigate this problem.<n>This paper reviews a broad spectrum of data balancing methods, classifying them into categories including synthetic oversampling, adaptive techniques, generative models, ensemble-based strategies, hybrid approaches, undersampling, and neighbor-based methods.
Score: 2.257105075408102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling techniques aimed at modifying class proportions. Conventional oversampling approaches like SMOTE enhance the representation of the minority class, whereas undersampling methods focus on trimming down the majority class. Advances in deep learning have facilitated the creation of more complex solutions, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which are capable of producing high-quality synthetic examples. This paper reviews a broad spectrum of data balancing methods, classifying them into categories including synthetic oversampling, adaptive techniques, generative models, ensemble-based strategies, hybrid approaches, undersampling, and neighbor-based methods. Furthermore, it highlights current developments in resampling techniques and discusses practical implementations and case studies that validate their effectiveness. The paper concludes by offering perspectives on potential directions for future exploration in this domain.

Related papers

CART-based Synthetic Tabular Data Generation for Imbalanced Regression [1.342834401139078]
We propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression.<n>The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space.<n>Our experimental study focuses on the prediction of extreme target values across benchmark datasets.
arXiv Detail & Related papers (2025-06-03T12:42:20Z)
Kernel-Based Enhanced Oversampling Method for Imbalanced Classification [10.112750055561877]
This paper introduces a novel oversampling technique designed to improve classification performance on imbalanced datasets.<n>The proposed method enhances the traditional SMOTE algorithm by incorporating convex combination and kernel-based weighting to generate synthetic samples that better represent the minority class.
arXiv Detail & Related papers (2025-04-12T09:24:23Z)
Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification [0.0]
We propose a novel learning framework that can generate synthetic data instances in a data-driven manner.<n>The proposed framework formulates the oversampling process as a composition of discrete decision criteria.<n>Experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.
arXiv Detail & Related papers (2025-02-08T13:35:00Z)
Synthetic Feature Augmentation Improves Generalization Performance of Language Models [8.463273762997398]
Training and fine-tuning deep learning models on limited and imbalanced datasets poses substantial challenges.<n>We propose augmenting features in the embedding space by generating synthetic samples using a range of techniques.<n>We validate the effectiveness of this approach across multiple open-source text classification benchmarks.
arXiv Detail & Related papers (2025-01-11T04:31:18Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Consistency Regularization for Generalizable Source-free Domain Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset. Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets. We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z)
Intra-class Adaptive Augmentation with Neighbor Correction for Deep Metric Learning [99.14132861655223]
We propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning. We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining. Our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.
arXiv Detail & Related papers (2022-11-29T14:52:38Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
A Hybrid Approach for Binary Classification of Imbalanced Data [0.0]
We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning. We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC.
arXiv Detail & Related papers (2022-07-06T15:18:41Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Ensemble Wrapper Subsampling for Deep Modulation Classification [70.91089216571035]
Subsampling of received wireless signals is important for relaxing hardware requirements as well as the computational cost of signal processing algorithms. We propose a subsampling technique to facilitate the use of deep learning for automatic modulation classification in wireless communication systems.
arXiv Detail & Related papers (2020-05-10T06:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.