Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder
- URL: http://arxiv.org/abs/2007.06198v2
- Date: Sat, 18 Jul 2020 13:09:29 GMT
- Title: Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder
- Authors: Gouthaman KV and Anurag Mittal
- Abstract summary: We propose a novel model-agnostic question encoder, Visually-Grounded Question (VGQE) for VQA.
VGQE utilizes both visual and language modalities equally while encoding the question.
We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results.
- Score: 12.56413718364189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown that current VQA models are heavily biased on the
language priors in the train set to answer the question, irrespective of the
image. E.g., overwhelmingly answer "what sport is" as "tennis" or "what color
banana" as "yellow." This behavior restricts them from real-world application
scenarios. In this work, we propose a novel model-agnostic question encoder,
Visually-Grounded Question Encoder (VGQE), for VQA that reduces this effect.
VGQE utilizes both visual and language modalities equally while encoding the
question. Hence the question representation itself gets sufficient
visual-grounding, and thus reduces the dependency of the model on the language
priors. We demonstrate the effect of VGQE on three recent VQA models and
achieve state-of-the-art results on the bias-sensitive split of the VQAv2
dataset; VQA-CPv2. Further, unlike the existing bias-reduction techniques, on
the standard VQAv2 benchmark, our approach does not drop the accuracy; instead,
it improves the performance.
Related papers
- Overcoming Language Bias in Remote Sensing Visual Question Answering via
Adversarial Training [22.473676537463607]
Visual Question Answering (VQA) models commonly face the challenge of language bias.
We present a novel framework to reduce the language bias of the VQA for remote sensing data.
arXiv Detail & Related papers (2023-06-01T09:32:45Z) - Overcoming Language Priors in Visual Question Answering via
Distinguishing Superficially Similar Instances [17.637150597493463]
We propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances.
We exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space.
Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2.
arXiv Detail & Related papers (2022-09-18T10:30:44Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering [62.88124382512111]
Most Visual Question Answering (VQA) models suffer from the language prior problem.
We introduce a self-supervised learning framework to solve this problem.
Our method can significantly outperform the state-of-the-art.
arXiv Detail & Related papers (2020-12-17T12:30:12Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Estimating semantic structure for the VQA answer space [6.49970685896541]
We show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models.
We report SOTA-level performance on the challenging VQAv2-CP dataset.
arXiv Detail & Related papers (2020-06-10T08:32:56Z) - Visual Grounding Methods for VQA are Working for the Wrong Reasons! [24.84797949716142]
We show that the performance improvements are not a result of improved visual grounding, but a regularization effect.
We propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
arXiv Detail & Related papers (2020-04-12T21:45:23Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.