Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
- URL: http://arxiv.org/abs/2006.05121v3
- Date: Wed, 7 Apr 2021 14:13:35 GMT
- Title: Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
- Authors: Corentin Kervadec (LIRIS), Grigory Antipov (Orange), Moez Baccouche
(Orange), Christian Wolf (LIRIS)
- Abstract summary: We argue that the standard evaluation metric, which consists in measuring the overall in-domain accuracy, is misleading.
We propose the GQA-OOD benchmark designed to overcome these concerns.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models for Visual Question Answering (VQA) are notorious for their tendency
to rely on dataset biases, as the large and unbalanced diversity of questions
and concepts involved and tends to prevent models from learning to reason,
leading them to perform educated guesses instead. In this paper, we claim that
the standard evaluation metric, which consists in measuring the overall
in-domain accuracy, is misleading. Since questions and concepts are unbalanced,
this tends to favor models which exploit subtle training set statistics.
Alternatively, naively introducing artificial distribution shifts between train
and test splits is also not completely satisfying. First, the shifts do not
reflect real-world tendencies, resulting in unsuitable models; second, since
the shifts are handcrafted, trained models are specifically designed for this
particular setting, and do not generalize to other configurations. We propose
the GQA-OOD benchmark designed to overcome these concerns: we measure and
compare accuracy over both rare and frequent question-answer pairs, and argue
that the former is better suited to the evaluation of reasoning abilities,
which we experimentally validate with models trained to more or less exploit
biases. In a large-scale study involving 7 VQA models and 3 bias reduction
techniques, we also experimentally demonstrate that these models fail to
address questions involving infrequent concepts and provide recommendations for
future directions of research.
Related papers
- Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items [0.716879432974126]
In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy.
Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training.
We introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability.
arXiv Detail & Related papers (2024-05-29T20:59:28Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - Generative Bias for Robust Visual Question Answering [74.42555378660653]
We propose a generative method to train the bias model directly from the target model, called GenB.
In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation.
We show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE.
arXiv Detail & Related papers (2022-08-01T08:58:02Z) - Reassessing Evaluation Practices in Visual Question Answering: A Case
Study on Out-of-Distribution Generalization [27.437077941786768]
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks.
We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations.
We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
arXiv Detail & Related papers (2022-05-24T16:44:45Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - BBQ: A Hand-Built Bias Benchmark for Question Answering [25.108222728383236]
It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA)
We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight textitattested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts.
We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting
arXiv Detail & Related papers (2021-10-15T16:43:46Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - How Transferable are Reasoning Patterns in VQA? [10.439369423744708]
We argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems.
We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases.
We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning.
arXiv Detail & Related papers (2021-04-08T10:18:45Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.