Related papers: Efficient Hybrid Oversampling and Intelligent Undersampling for Imbalanced Big Data Classification

Efficient Hybrid Oversampling and Intelligent Undersampling for Imbalanced Big Data Classification

URL: http://arxiv.org/abs/2310.05789v1
Date: Mon, 9 Oct 2023 15:22:13 GMT
Title: Efficient Hybrid Oversampling and Intelligent Undersampling for Imbalanced Big Data Classification
Authors: Carla Vairetti, Jos\'e Luis Assadi, Sebasti\'an Maldonado
Abstract summary: We present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets.
Score: 1.03590082373586
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.

Related papers

Neighbor displacement-based enhanced synthetic oversampling for multiclass imbalanced data [0.0]
Imbalanced multiclass datasets pose challenges for machine learning algorithms. Existing methods still suffer from sparse data and may not accurately represent the original data patterns. A hybrid method called Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO) is proposed in this paper.
arXiv Detail & Related papers (2025-01-07T19:15:00Z)
Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity. Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
BSGAN: A Novel Oversampling Technique for Imbalanced Pattern Recognitions [0.0]
Class imbalanced problems (CIP) are one of the potential challenges in developing unbiased Machine Learning (ML) models for predictions. CIP occurs when data samples are not equally distributed between the two or multiple classes. We propose a hybrid oversampling technique by combining the power of borderline SMOTE and Generative Adrial Network to generate more diverse data.
arXiv Detail & Related papers (2023-05-16T20:02:39Z)
Generative Oversampling for Imbalanced Data via Majority-Guided VAE [15.93867386081279]
We propose a novel over-sampling model, called Majority-Guided VAE(MGVAE), which generates new minority samples under the guidance of a majority-based prior. In this way, the newly generated minority samples can inherit the diversity and richness of the majority ones, thus mitigating overfitting in downstream tasks.
arXiv Detail & Related papers (2023-02-14T06:35:23Z)
ScoreMix: A Scalable Augmentation Strategy for Training GANs with Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available. We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z)
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking. Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z)
A Novel Hybrid Sampling Framework for Imbalanced Learning [0.0]
"SMOTE-RUS-NC" has been compared with other state-of-the-art sampling techniques. Rigorous experimentation has been conducted on 26 imbalanced datasets.
arXiv Detail & Related papers (2022-08-20T07:04:00Z)
Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively. We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z)
Stop Oversampling for Class Imbalance Learning: A Critical Review [0.9208007322096533]
Oversampling has been employed to overcome the challenge of learning from imbalanced datasets. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. We devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process.
arXiv Detail & Related papers (2022-02-04T15:11:11Z)
Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework. We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z)
UGRWO-Sampling for COVID-19 dataset: A modified random walk under-sampling approach based on graphs to imbalanced data classification [2.15242029196761]
This paper proposes a new RWO-Sampling (Random Walk Over-Sampling) based on graphs for imbalanced datasets. Two schemes based on under-sampling and over-sampling methods are introduced to keep the proximity information robust to noises and outliers.
arXiv Detail & Related papers (2020-02-10T03:29:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.