Foundations of data imbalance and solutions for a data democracy
- URL: http://arxiv.org/abs/2108.00071v1
- Date: Fri, 30 Jul 2021 20:37:23 GMT
- Title: Foundations of data imbalance and solutions for a data democracy
- Authors: Ajay Kulkarni, Deri Chong, Feras A. Batarseh
- Abstract summary: Dealing with imbalanced data is a prevalent problem while performing classification on the datasets.
Two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept.
Measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dealing with imbalanced data is a prevalent problem while performing
classification on the datasets. Many times, this problem contributes to bias
while making decisions or implementing policies. Thus, it is vital to
understand the factors which cause imbalance in the data (or class imbalance).
Such hidden biases and imbalances can lead to data tyranny and a major
challenge to a data democracy. In this chapter, two essential statistical
elements are resolved: the degree of class imbalance and the complexity of the
concept; solving such issues helps in building the foundations of a data
democracy. Furthermore, statistical measures which are appropriate in these
scenarios are discussed and implemented on a real-life dataset (car insurance
claims). In the end, popular data-level methods such as random oversampling,
random undersampling, synthetic minority oversampling technique, Tomek link,
and others are implemented in Python, and their performance is compared.
Related papers
- Mind the Graph When Balancing Data for Fairness or Robustness [73.03155969727038]
We define conditions on the training distribution for data balancing to lead to fair or robust models.
Our results show that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies.
Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.
arXiv Detail & Related papers (2024-06-25T10:16:19Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Effective Class-Imbalance learning based on SMOTE and Convolutional
Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results.
In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs)
In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Imbalanced Classification via Explicit Gradient Learning From Augmented
Data [0.0]
We propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances.
The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
arXiv Detail & Related papers (2022-02-21T22:16:50Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Sequential Targeting: an incremental learning approach for data
imbalance in text classification [7.455546102930911]
Methods to handle imbalanced datasets are crucial for alleviating distributional skews.
We propose a novel training method, Sequential Targeting(ST), independent of the effectiveness of the representation method.
We demonstrate the effectiveness of our method through experiments on simulated benchmark datasets (IMDB) and data collected from NAVER.
arXiv Detail & Related papers (2020-11-20T04:54:00Z) - Handling Imbalanced Data: A Case Study for Binary Class Problems [0.0]
The major issues in terms of solving for classification problems are the issues of Imbalanced data.
This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms.
We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
arXiv Detail & Related papers (2020-10-09T02:04:14Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Contrastive Examples for Addressing the Tyranny of the Majority [83.93825214500131]
We propose to create a balanced training dataset, consisting of the original dataset plus new data points in which the group memberships are intervened.
We show that current generative adversarial networks are a powerful tool for learning these data points, called contrastive examples.
arXiv Detail & Related papers (2020-04-14T14:06:44Z) - Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big
Data [11.117880929232575]
Split data strategies and lack of data in minority class due to the use of MapReduce paradigm have posed new challenges for tackling imbalanced data problems.
Smart Data refers to data of enough quality to achieve high performance models.
We propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains.
arXiv Detail & Related papers (2020-01-16T12:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.