Learning content and context with language bias for Visual Question
Answering
- URL: http://arxiv.org/abs/2012.11134v1
- Date: Mon, 21 Dec 2020 06:22:50 GMT
- Title: Learning content and context with language bias for Visual Question
Answering
- Authors: Chao Yang, Su Feng, Dongsheng Li, Huawei Shen, Guoqing Wang and Bin
Jiang
- Abstract summary: We propose a novel learning strategy named CCB, which forces VQA models to answer questions relying on Content and Context with language bias.
CCB outperforms the state-of-the-art methods in terms of accuracy on VQA-CP v2.
- Score: 31.39505099600821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering (VQA) is a challenging multimodal task to answer
questions about an image. Many works concentrate on how to reduce language bias
which makes models answer questions ignoring visual content and language
context. However, reducing language bias also weakens the ability of VQA models
to learn context prior. To address this issue, we propose a novel learning
strategy named CCB, which forces VQA models to answer questions relying on
Content and Context with language Bias. Specifically, CCB establishes Content
and Context branches on top of a base VQA model and forces them to focus on
local key content and global effective context respectively. Moreover, a joint
loss function is proposed to reduce the importance of biased samples and retain
their beneficial influence on answering questions. Experiments show that CCB
outperforms the state-of-the-art methods in terms of accuracy on VQA-CP v2.
Related papers
- Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA [19.6585442152102]
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer.
Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image.
arXiv Detail & Related papers (2024-06-27T02:19:38Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Overcoming Language Bias in Remote Sensing Visual Question Answering via
Adversarial Training [22.473676537463607]
Visual Question Answering (VQA) models commonly face the challenge of language bias.
We present a novel framework to reduce the language bias of the VQA for remote sensing data.
arXiv Detail & Related papers (2023-06-01T09:32:45Z) - Unveiling Cross Modality Bias in Visual Question Answering: A Causal
View with Possible Worlds VQA [111.41719652451701]
We first model a confounding effect that causes language and vision bias simultaneously.
We then propose a counterfactual inference to remove the influence of this effect.
The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.
arXiv Detail & Related papers (2023-05-31T09:02:58Z) - SC-ML: Self-supervised Counterfactual Metric Learning for Debiased
Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better.
SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Counterfactual VQA: A Cause-Effect Look at Language Bias [117.84189187160005]
VQA models tend to rely on language bias as a shortcut and fail to sufficiently learn the multi-modal knowledge from both vision and language.
We propose a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers.
arXiv Detail & Related papers (2020-06-08T01:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.