Bayesian analysis of the prevalence bias: learning and predicting from
imbalanced data
- URL: http://arxiv.org/abs/2108.00250v1
- Date: Sat, 31 Jul 2021 14:36:33 GMT
- Title: Bayesian analysis of the prevalence bias: learning and predicting from
imbalanced data
- Authors: Loic Le Folgoc and Vasileios Baltatzis and Amir Alansary and Sujal
Desai and Anand Devaraj and Sam Ellis and Octavio E. Martinez Manzanera and
Fahdi Kanavati and Arjun Nair and Julia Schnabel and Ben Glocker
- Abstract summary: This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias.
It offers an alternative to principled training losses and complements test-time procedures based on selecting an operating point from summary curves.
It integrates seamlessly in the current paradigm of (deep) learning using backpropagation and naturally with Bayesian models.
- Score: 10.659348599372944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Datasets are rarely a realistic approximation of the target population. Say,
prevalence is misrepresented, image quality is above clinical standards, etc.
This mismatch is known as sampling bias. Sampling biases are a major hindrance
for machine learning models. They cause significant gaps between model
performance in the lab and in the real world. Our work is a solution to
prevalence bias. Prevalence bias is the discrepancy between the prevalence of a
pathology and its sampling rate in the training dataset, introduced upon
collecting data or due to the practioner rebalancing the training batches. This
paper lays the theoretical and computational framework for training models, and
for prediction, in the presence of prevalence bias. Concretely a bias-corrected
loss function, as well as bias-corrected predictive rules, are derived under
the principles of Bayesian risk minimization. The loss exhibits a direct
connection to the information gain. It offers a principled alternative to
heuristic training losses and complements test-time procedures based on
selecting an operating point from summary curves. It integrates seamlessly in
the current paradigm of (deep) learning using stochastic backpropagation and
naturally with Bayesian models.
Related papers
- Model Debiasing by Learnable Data Augmentation [19.625915578646758]
This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training.
Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods.
arXiv Detail & Related papers (2024-08-09T09:19:59Z) - Echoes: Unsupervised Debiasing via Pseudo-bias Labeling in an Echo
Chamber [17.034228910493056]
This paper presents experimental analyses revealing that the existing biased models overfit to bias-conflicting samples in the training data.
We propose a straightforward and effective method called Echoes, which trains a biased model and a target model with a different strategy.
Our approach achieves superior debiasing results compared to the existing baselines on both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-06T13:13:18Z) - Provable Detection of Propagating Sampling Bias in Prediction Models [1.7709344190822935]
We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage.
Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data.
We demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.
arXiv Detail & Related papers (2023-02-13T23:39:35Z) - Pseudo Bias-Balanced Learning for Debiased Chest X-ray Classification [57.53567756716656]
We study the problem of developing debiased chest X-ray diagnosis models without knowing exactly the bias labels.
We propose a novel algorithm, pseudo bias-balanced learning, which first captures and predicts per-sample bias labels.
Our proposed method achieved consistent improvements over other state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-18T11:02:18Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Increasing Fairness in Predictions Using Bias Parity Score Based Loss
Function Regularization [0.8594140167290099]
We introduce a family of fairness enhancing regularization components that we use in conjunction with the traditional binary-cross-entropy based accuracy loss.
We deploy them in the context of a recidivism prediction task as well as on a census-based adult income dataset.
arXiv Detail & Related papers (2021-11-05T17:42:33Z) - Learning from Failure: Training Debiased Classifier from Biased
Classifier [76.52804102765931]
We show that neural networks learn to rely on spurious correlation only when it is "easier" to learn than the desired knowledge.
We propose a failure-based debiasing scheme by training a pair of neural networks simultaneously.
Our method significantly improves the training of the network against various types of biases in both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-07-06T07:20:29Z) - Bayesian Sampling Bias Correction: Training with the Right Loss Function [0.0]
We derive a family of loss functions to train models in the presence of sampling bias.
Examples are when the prevalence of a pathology differs from its sampling rate in the training dataset, or when a machine learning practioner rebalances their training dataset.
arXiv Detail & Related papers (2020-06-24T15:10:43Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.