An Exploration of How Training Set Composition Bias in Machine Learning
Affects Identifying Rare Objects
- URL: http://arxiv.org/abs/2207.03207v1
- Date: Thu, 7 Jul 2022 10:26:55 GMT
- Title: An Exploration of How Training Set Composition Bias in Machine Learning
Affects Identifying Rare Objects
- Authors: Sean E. Lake and Chao-Wei Tsai
- Abstract summary: It is common to up-weight the examples of the rare class to ensure it isn't ignored.
It is also a frequent practice to train on restricted data where the balance of source types is closer to equal.
Here we show that these practices can bias the model toward over-assigning sources to the rare class.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: When training a machine learning classifier on data where one of the classes
is intrinsically rare, the classifier will often assign too few sources to the
rare class. To address this, it is common to up-weight the examples of the rare
class to ensure it isn't ignored. It is also a frequent practice to train on
restricted data where the balance of source types is closer to equal for the
same reason. Here we show that these practices can bias the model toward
over-assigning sources to the rare class. We also explore how to detect when
training data bias has had a statistically significant impact on the trained
model's predictions, and how to reduce the bias's impact. While the magnitude
of the impact of the techniques developed here will vary with the details of
the application, for most cases it should be modest. They are, however,
universally applicable to every time a machine learning classification model is
used, making them analogous to Bessel's correction to the sample variance.
Related papers
- Model Debiasing by Learnable Data Augmentation [19.625915578646758]
This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training.
Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods.
arXiv Detail & Related papers (2024-08-09T09:19:59Z) - SelecMix: Debiased Learning by Contradicting-pair Sampling [39.613595678105845]
Neural networks trained with ERM learn unintended decision rules when their training data is biased.
We propose an alternative based on mixup, a popular augmentation that creates convex combinations of training examples.
Our method, coined SelecMix, applies mixup to contradicting pairs of examples, defined as showing either (i) the same label but dissimilar biased features, or (ii) different labels but similar biased features.
arXiv Detail & Related papers (2022-11-04T07:15:36Z) - Prisoners of Their Own Devices: How Models Induce Data Bias in
Performative Prediction [4.874780144224057]
A biased model can make decisions that disproportionately harm certain groups in society.
Much work has been devoted to measuring unfairness in static ML environments, but not in dynamic, performative prediction ones.
We propose a taxonomy to characterize bias in the data, and study cases where it is shaped by model behaviour.
arXiv Detail & Related papers (2022-06-27T10:56:04Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Right for the Right Latent Factors: Debiasing Generative Models via
Disentanglement [20.41752850243945]
Key assumption of most statistical machine learning methods is that they have access to independent samples from the distribution of data they encounter at test time.
In particular, machine learning models have been shown to exhibit Clever-Hans-like behaviour, meaning that spurious correlations in the training set are inadvertently learnt.
We propose to debias generative models by disentangling their internal representations, which is achieved via human feedback.
arXiv Detail & Related papers (2022-02-01T13:16:18Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - Bayesian analysis of the prevalence bias: learning and predicting from
imbalanced data [10.659348599372944]
This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias.
It offers an alternative to principled training losses and complements test-time procedures based on selecting an operating point from summary curves.
It integrates seamlessly in the current paradigm of (deep) learning using backpropagation and naturally with Bayesian models.
arXiv Detail & Related papers (2021-07-31T14:36:33Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Learning from Failure: Training Debiased Classifier from Biased
Classifier [76.52804102765931]
We show that neural networks learn to rely on spurious correlation only when it is "easier" to learn than the desired knowledge.
We propose a failure-based debiasing scheme by training a pair of neural networks simultaneously.
Our method significantly improves the training of the network against various types of biases in both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-07-06T07:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.