Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models
- URL: http://arxiv.org/abs/2305.06841v2
- Date: Tue, 6 Feb 2024 11:30:00 GMT
- Title: Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models
- Authors: Luk\'a\v{s} Mikula, Michal \v{S}tef\'anik, Marek Petrovi\v{c}, Petr
Sojka
- Abstract summary: We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
- Score: 3.9052860539161918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While the Large Language Models (LLMs) dominate a majority of language
understanding tasks, previous work shows that some of these results are
supported by modelling spurious correlations of training datasets. Authors
commonly assess model robustness by evaluating their models on
out-of-distribution (OOD) datasets of the same task, but these datasets might
share the bias of the training dataset.
We propose a simple method for measuring a scale of models' reliance on any
identified spurious feature and assess the robustness towards a large set of
known and newly found prediction biases for various pre-trained models and
debiasing methods in Question Answering (QA). We find that while existing
debiasing methods can mitigate reliance on a chosen spurious feature, the OOD
performance gains of these methods can not be explained by mitigated reliance
on biased features, suggesting that biases are shared among different QA
datasets. Finally, we evidence this to be the case by measuring that the
performance of models trained on different QA datasets relies comparably on the
same bias features. We hope these results will motivate future work to refine
the reports of LMs' robustness to a level of adversarial samples addressing
specific spurious features.
Related papers
- Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL)
It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights.
The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z) - Debiasing Multimodal Models via Causal Information Minimization [65.23982806840182]
We study bias arising from confounders in a causal graph for multimodal data.
Robust predictive features contain diverse information that helps a model generalize to out-of-distribution data.
We use these features as confounder representations and use them via methods motivated by causal theory to remove bias from models.
arXiv Detail & Related papers (2023-11-28T16:46:14Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - Guide the Learner: Controlling Product of Experts Debiasing Method Based
on Token Attribution Similarities [17.082695183953486]
A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model.
Here, the underlying assumption is that the biased model resorts to shortcut features.
We introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts loss function.
arXiv Detail & Related papers (2023-02-06T15:21:41Z) - Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - Unveiling Project-Specific Bias in Neural Code Models [20.131797671630963]
Large Language Models (LLMs) based neural code models often struggle to generalize effectively to real-world inter-project out-of-distribution (OOD) data.
We show that this phenomenon is caused by the heavy reliance on project-specific shortcuts for prediction instead of ground-truth evidence.
We propose a novel bias mitigation mechanism that regularizes the model's learning behavior by leveraging latent logic relations among samples.
arXiv Detail & Related papers (2022-01-19T02:09:48Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.