An Empirical Study on the Language Modal in Visual Question Answering
- URL: http://arxiv.org/abs/2305.10143v2
- Date: Tue, 5 Sep 2023 02:52:36 GMT
- Title: An Empirical Study on the Language Modal in Visual Question Answering
- Authors: Daowan Peng, Wei Wei, Xian-Ling Mao, Yuanyuan Fu, Dangyang Chen
- Abstract summary: Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain.
This paper attempts to provide new insights into the influence of language modality on VQA performance.
- Score: 31.692905677913068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalization beyond in-domain experience to out-of-distribution data is of
paramount significance in the AI domain. Of late, state-of-the-art Visual
Question Answering (VQA) models have shown impressive performance on in-domain
data, partially due to the language priors bias which, however, hinders the
generalization ability in practice. This paper attempts to provide new insights
into the influence of language modality on VQA performance from an empirical
study perspective. To achieve this, we conducted a series of experiments on six
models. The results of these experiments revealed that, 1) apart from prior
bias caused by question types, there is a notable influence of postfix-related
bias in inducing biases, and 2) training VQA models with word-sequence-related
variant questions demonstrated improved performance on the out-of-distribution
benchmark, and the LXMERT even achieved a 10-point gain without adopting any
debiasing methods. We delved into the underlying reasons behind these
experimental results and put forward some simple proposals to reduce the
models' dependency on language priors. The experimental results demonstrated
the effectiveness of our proposed method in improving performance on the
out-of-distribution benchmark, VQA-CPv2. We hope this study can inspire novel
insights for future research on designing bias-reduction approaches.
Related papers
- Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Robust Visual Question Answering: Datasets, Methods, and Future
Challenges [23.59923999144776]
Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question.
Previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers.
Various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively.
arXiv Detail & Related papers (2023-07-21T10:12:09Z) - Unveiling Cross Modality Bias in Visual Question Answering: A Causal
View with Possible Worlds VQA [111.41719652451701]
We first model a confounding effect that causes language and vision bias simultaneously.
We then propose a counterfactual inference to remove the influence of this effect.
The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.
arXiv Detail & Related papers (2023-05-31T09:02:58Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - Pre-training also Transfers Non-Robustness [20.226917627173126]
In spite of its recognized contribution to generalization, pre-training also transfers the non-robustness from pre-trained model into the fine-tuned model.
Results validate the effectiveness in alleviating non-robustness and preserving generalization.
arXiv Detail & Related papers (2021-06-21T11:16:13Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.