Survey of Imbalanced Data Methodologies
- URL: http://arxiv.org/abs/2104.02240v1
- Date: Tue, 6 Apr 2021 02:10:22 GMT
- Title: Survey of Imbalanced Data Methodologies
- Authors: Lian Yu, Nengfeng Zhou
- Abstract summary: We applied the under-sampling/over-sampling methodologies to several modeling algorithms on UCI and Keel data sets.
The performance was analyzed for class-imbalance methods, modeling algorithms and grid search criteria comparison.
- Score: 1.370633147306388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imbalanced data set is a problem often found and well-studied in financial
industry. In this paper, we reviewed and compared some popular methodologies
handling data imbalance. We then applied the under-sampling/over-sampling
methodologies to several modeling algorithms on UCI and Keel data sets. The
performance was analyzed for class-imbalance methods, modeling algorithms and
grid search criteria comparison.
Related papers
- Methods for Class-Imbalanced Learning with Support Vector Machines: A Review and an Empirical Evaluation [22.12895887111828]
We introduce a hierarchical categorization of SVM-based models with respect to class-imbalanced learning.
We compare the performances of various representative SVM-based models in each category using benchmark imbalanced data sets.
Our findings reveal that while algorithmic methods are less time-consuming owing to no data pre-processing requirements, fusion methods, which combine both re-sampling and algorithmic approaches, generally perform the best.
arXiv Detail & Related papers (2024-06-05T15:55:08Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Revisiting Long-tailed Image Classification: Survey and Benchmarks with
New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution.
Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - A Hybrid Approach for Binary Classification of Imbalanced Data [0.0]
We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning.
We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC.
arXiv Detail & Related papers (2022-07-06T15:18:41Z) - A survey on learning from imbalanced data streams: taxonomy, challenges,
empirical study, and reproducible experimental framework [12.856833690265985]
Class imbalance poses new challenges when it comes to classifying data streams.
Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches.
This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms.
arXiv Detail & Related papers (2022-04-07T20:13:55Z) - ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets [3.214208422566496]
We come up with a bagging ensemble learning framework based on an anomaly detection scoring system.
We test out that our ensemble learning model can dramatically improve performance of base estimators.
arXiv Detail & Related papers (2022-03-21T07:20:41Z) - Handling Imbalanced Data: A Case Study for Binary Class Problems [0.0]
The major issues in terms of solving for classification problems are the issues of Imbalanced data.
This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms.
We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
arXiv Detail & Related papers (2020-10-09T02:04:14Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z) - Machine Learning Pipeline for Pulsar Star Dataset [58.720142291102135]
This work brings together some of the most common machine learning (ML) algorithms.
The objective is to make a comparison at the level of obtained results from a set of unbalanced data.
arXiv Detail & Related papers (2020-05-03T23:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.