Unveiling Cross Modality Bias in Visual Question Answering: A Causal
View with Possible Worlds VQA
- URL: http://arxiv.org/abs/2305.19664v1
- Date: Wed, 31 May 2023 09:02:58 GMT
- Title: Unveiling Cross Modality Bias in Visual Question Answering: A Causal
View with Possible Worlds VQA
- Authors: Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu,
Jiebo Luo
- Abstract summary: We first model a confounding effect that causes language and vision bias simultaneously.
We then propose a counterfactual inference to remove the influence of this effect.
The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.
- Score: 111.41719652451701
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To increase the generalization capability of VQA systems, many recent studies
have tried to de-bias spurious language or vision associations that shortcut
the question or image to the answer. Despite these efforts, the literature
fails to address the confounding effect of vision and language simultaneously.
As a result, when they reduce bias learned from one modality, they usually
increase bias from the other. In this paper, we first model a confounding
effect that causes language and vision bias simultaneously, then propose a
counterfactual inference to remove the influence of this effect. The model
trained in this strategy can concurrently and efficiently reduce vision and
language bias. To the best of our knowledge, this is the first work to reduce
biases resulting from confounding effects of vision and language in VQA,
leveraging causal explain-away relations. We accompany our method with an
explain-away strategy, pushing the accuracy of the questions with numerical
answers results compared to existing methods that have been an open problem.
The proposed method outperforms the state-of-the-art methods in VQA-CP v2
datasets.
Related papers
- Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Overcoming Language Bias in Remote Sensing Visual Question Answering via
Adversarial Training [22.473676537463607]
Visual Question Answering (VQA) models commonly face the challenge of language bias.
We present a novel framework to reduce the language bias of the VQA for remote sensing data.
arXiv Detail & Related papers (2023-06-01T09:32:45Z) - An Empirical Study on the Language Modal in Visual Question Answering [31.692905677913068]
Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain.
This paper attempts to provide new insights into the influence of language modality on VQA performance.
arXiv Detail & Related papers (2023-05-17T11:56:40Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z) - Language bias in Visual Question Answering: A Survey and Taxonomy [0.0]
We conduct a comprehensive review and analysis of this field for the first time.
We classify the existing methods according to three categories, including enhancing visual information.
The causes of language bias are revealed and classified.
arXiv Detail & Related papers (2021-11-16T15:01:24Z) - Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering [62.88124382512111]
Most Visual Question Answering (VQA) models suffer from the language prior problem.
We introduce a self-supervised learning framework to solve this problem.
Our method can significantly outperform the state-of-the-art.
arXiv Detail & Related papers (2020-12-17T12:30:12Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Counterfactual VQA: A Cause-Effect Look at Language Bias [117.84189187160005]
VQA models tend to rely on language bias as a shortcut and fail to sufficiently learn the multi-modal knowledge from both vision and language.
We propose a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers.
arXiv Detail & Related papers (2020-06-08T01:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.