Related papers: Foundations of data imbalance and solutions for a data democracy

Foundations of data imbalance and solutions for a data democracy

URL: http://arxiv.org/abs/2108.00071v1
Date: Fri, 30 Jul 2021 20:37:23 GMT
Title: Foundations of data imbalance and solutions for a data democracy
Authors: Ajay Kulkarni, Deri Chong, Feras A. Batarseh
Abstract summary: Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept. Measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which cause imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept; solving such issues helps in building the foundations of a data democracy. Furthermore, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular data-level methods such as random oversampling, random undersampling, synthetic minority oversampling technique, Tomek link, and others are implemented in Python, and their performance is compared.

Related papers

FILM: Framework for Imbalanced Learning Machines based on a new unbiased performance measure and a new ensemble-based technique [37.94431794242543]
This research addresses the challenges of handling unbalanced datasets for binary classification tasks. Standard evaluation metrics are often biased by the disproportionate representation of the minority class. We propose a novel metric, the Unbiased Integration Coefficients, which exhibits significantly reduced bias.
arXiv Detail & Related papers (2025-03-06T12:15:56Z)
Neighbor displacement-based enhanced synthetic oversampling for multiclass imbalanced data [0.0]
Imbalanced multiclass datasets pose challenges for machine learning algorithms. Existing methods still suffer from sparse data and may not accurately represent the original data patterns. A hybrid method called Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO) is proposed in this paper.
arXiv Detail & Related papers (2025-01-07T19:15:00Z)
Mind the Graph When Balancing Data for Fairness or Robustness [73.03155969727038]
We define conditions on the training distribution for data balancing to lead to fair or robust models. Our results show that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.
arXiv Detail & Related papers (2024-06-25T10:16:19Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Imbalanced Classification via Explicit Gradient Learning From Augmented Data [0.0]
We propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances. The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
arXiv Detail & Related papers (2022-02-21T22:16:50Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
Sequential Targeting: an incremental learning approach for data imbalance in text classification [7.455546102930911]
Methods to handle imbalanced datasets are crucial for alleviating distributional skews. We propose a novel training method, Sequential Targeting(ST), independent of the effectiveness of the representation method. We demonstrate the effectiveness of our method through experiments on simulated benchmark datasets (IMDB) and data collected from NAVER.
arXiv Detail & Related papers (2020-11-20T04:54:00Z)
Handling Imbalanced Data: A Case Study for Binary Class Problems [0.0]
The major issues in terms of solving for classification problems are the issues of Imbalanced data. This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms. We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
arXiv Detail & Related papers (2020-10-09T02:04:14Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
Contrastive Examples for Addressing the Tyranny of the Majority [83.93825214500131]
We propose to create a balanced training dataset, consisting of the original dataset plus new data points in which the group memberships are intervened. We show that current generative adversarial networks are a powerful tool for learning these data points, called contrastive examples.
arXiv Detail & Related papers (2020-04-14T14:06:44Z)
Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data [11.117880929232575]
Split data strategies and lack of data in minority class due to the use of MapReduce paradigm have posed new challenges for tackling imbalanced data problems. Smart Data refers to data of enough quality to achieve high performance models. We propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains.
arXiv Detail & Related papers (2020-01-16T12:25:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.