A Study imbalance handling by various data sampling methods in binary
classification
- URL: http://arxiv.org/abs/2105.10959v1
- Date: Sun, 23 May 2021 15:27:47 GMT
- Title: A Study imbalance handling by various data sampling methods in binary
classification
- Authors: Mohamed Hamama
- Abstract summary: This research report is to present the our learning curve and the exposure to the Machine Learning life cycle.
We take to explore various techniques from pre-processing to the final optimization and model evaluation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The purpose of this research report is to present the our learning curve and
the exposure to the Machine Learning life cycle, with the use of a Kaggle
binary classification data set and taking to explore various techniques from
pre-processing to the final optimization and model evaluation, also we
highlight on the data imbalance issue and we discuss the different methods of
handling that imbalance on the data level by over-sampling and under sampling
not only to reach a balanced class representation but to improve the overall
performance. This work also opens some gaps for future work.
Related papers
- Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification [0.0]
We propose a novel learning framework that can generate synthetic data instances in a data-driven manner.
The proposed framework formulates the oversampling process as a composition of discrete decision criteria.
Experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.
arXiv Detail & Related papers (2025-02-08T13:35:00Z) - Statistical Undersampling with Mutual Information and Support Points [4.118796935183671]
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning.
This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization.
arXiv Detail & Related papers (2024-12-19T04:48:29Z) - Combining Denoising Autoencoders with Contrastive Learning to fine-tune Transformer Models [0.0]
This work proposes a 3 Phase technique to adjust a base model for a classification task.
We adapt the model's signal to the data distribution by performing further training with a Denoising Autoencoder (DAE)
In addition, we introduce a new data augmentation approach for Supervised Contrastive Learning to correct the unbalanced datasets.
arXiv Detail & Related papers (2024-05-23T11:08:35Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - Revisiting Long-tailed Image Classification: Survey and Benchmarks with
New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution.
Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Study of sampling methods in sentiment analysis of imbalanced data [0.0]
This work investigates the application of sampling methods for sentiment analysis on two different datasets.
One dataset contains online user reviews from the cooking platform Epicurious and the other contains comments given to the Planned Parenthood organization.
arXiv Detail & Related papers (2021-06-12T03:16:18Z) - Semi-supervised Long-tailed Recognition using Alternate Sampling [95.93760490301395]
Main challenges in long-tailed recognition come from the imbalanced data distribution and sample scarcity in its tail classes.
We propose a new recognition setting, namely semi-supervised long-tailed recognition.
We demonstrate significant accuracy improvements over other competitive methods on two datasets.
arXiv Detail & Related papers (2021-05-01T00:43:38Z) - Handling Imbalanced Data: A Case Study for Binary Class Problems [0.0]
The major issues in terms of solving for classification problems are the issues of Imbalanced data.
This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms.
We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
arXiv Detail & Related papers (2020-10-09T02:04:14Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.