Revisiting the Dataset Bias Problem from a Statistical Perspective
- URL: http://arxiv.org/abs/2402.03577v1
- Date: Mon, 5 Feb 2024 22:58:06 GMT
- Title: Revisiting the Dataset Bias Problem from a Statistical Perspective
- Authors: Kien Do, Dung Nguyen, Hung Le, Thao Le, Dang Nguyen, Haripriya
Harikumar, Truyen Tran, Santu Rana, Svetha Venkatesh
- Abstract summary: We study the "dataset bias" problem from a statistical standpoint.
We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b.
We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
- Score: 72.94990819287551
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the "dataset bias" problem from a statistical
standpoint, and identify the main cause of the problem as the strong
correlation between a class attribute u and a non-class attribute b in the
input x, represented by p(u|b) differing significantly from p(u). Since p(u|b)
appears as part of the sampling distributions in the standard maximum
log-likelihood (MLL) objective, a model trained on a biased dataset via MLL
inherently incorporates such correlation into its parameters, leading to poor
generalization to unbiased test data. From this observation, we propose to
mitigate dataset bias via either weighting the objective of each sample n by
\frac{1}{p(u_{n}|b_{n})} or sampling that sample with a weight proportional to
\frac{1}{p(u_{n}|b_{n})}. While both methods are statistically equivalent, the
former proves more stable and effective in practice. Additionally, we establish
a connection between our debiasing approach and causal reasoning, reinforcing
our method's theoretical foundation. However, when the bias label is
unavailable, computing p(u|b) exactly is difficult. To overcome this challenge,
we propose to approximate \frac{1}{p(u|b)} using a biased classifier trained
with "bias amplification" losses. Extensive experiments on various biased
datasets demonstrate the superiority of our method over existing debiasing
techniques in most settings, validating our theoretical analysis.
Related papers
- CosFairNet:A Parameter-Space based Approach for Bias Free Learning [1.9116784879310025]
Deep neural networks trained on biased data often inadvertently learn unintended inference rules.
We introduce a novel approach to address bias directly in the model's parameter space, preventing its propagation across layers.
We show enhanced classification accuracy and debiasing effectiveness across various synthetic and real-world datasets.
arXiv Detail & Related papers (2024-10-19T13:06:40Z) - IBADR: an Iterative Bias-Aware Dataset Refinement Framework for
Debiasing NLU models [52.03761198830643]
We propose IBADR, an Iterative Bias-Aware dataset Refinement framework.
We first train a shallow model to quantify the bias degree of samples in the pool.
Then, we pair each sample with a bias indicator representing its bias degree, and use these extended samples to train a sample generator.
In this way, this generator can effectively learn the correspondence relationship between bias indicators and samples.
arXiv Detail & Related papers (2023-11-01T04:50:38Z) - Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z) - Echoes: Unsupervised Debiasing via Pseudo-bias Labeling in an Echo
Chamber [17.034228910493056]
This paper presents experimental analyses revealing that the existing biased models overfit to bias-conflicting samples in the training data.
We propose a straightforward and effective method called Echoes, which trains a biased model and a target model with a different strategy.
Our approach achieves superior debiasing results compared to the existing baselines on both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-06T13:13:18Z) - Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets.
We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias.
DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z) - BiasEnsemble: Revisiting the Importance of Amplifying Bias for Debiasing [31.665352191081357]
"Debiasing" aims to train a classifier to be less susceptible to dataset bias.
$f_B$ is trained to focus on bias-aligned samples while $f_D$ is mainly trained with bias-conflicting samples.
We propose a novel biased sample selection method BiasEnsemble which removes the bias-conflicting samples.
arXiv Detail & Related papers (2022-05-29T07:55:06Z) - Learning Debiased Representation via Disentangled Feature Augmentation [19.348340314001756]
This paper presents an empirical analysis revealing that training with "diverse" bias-conflicting samples is crucial for debiasing.
We propose a novel feature-level data augmentation technique in order to synthesize diverse bias-conflicting samples.
arXiv Detail & Related papers (2021-07-03T08:03:25Z) - AutoDebias: Learning to Debias for Recommendation [43.84313723394282]
We propose textitAotoDebias that leverages another (small) set of uniform data to optimize the debiasing parameters.
We derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy.
arXiv Detail & Related papers (2021-05-10T08:03:48Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z) - Towards Robustifying NLI Models Against Lexical Dataset Biases [94.79704960296108]
This paper explores both data-level and model-level debiasing methods to robustify models against lexical dataset biases.
First, we debias the dataset through data augmentation and enhancement, but show that the model bias cannot be fully removed via this method.
The second approach employs a bag-of-words sub-model to capture the features that are likely to exploit the bias and prevents the original model from learning these biased features.
arXiv Detail & Related papers (2020-05-10T17:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.