Visual Grounding Methods for VQA are Working for the Wrong Reasons!
- URL: http://arxiv.org/abs/2004.05704v4
- Date: Tue, 23 Apr 2024 13:38:36 GMT
- Title: Visual Grounding Methods for VQA are Working for the Wrong Reasons!
- Authors: Robik Shrestha, Kushal Kafle, Christopher Kanan,
- Abstract summary: We show that the performance improvements are not a result of improved visual grounding, but a regularization effect.
We propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
- Score: 24.84797949716142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
Related papers
- Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Answer Questions with Right Image Regions: A Visual Attention
Regularization Approach [46.55924742590242]
We propose a novel visual attention regularization approach, namely AttReg, for better visual grounding in Visual Question Answering (VQA)
AttReg identifies the image regions which are essential for question answering yet unexpectedly ignored by the backbone model.
It can achieve a new state-of-the-art accuracy of 59.92% with an absolute performance gain of 6.93% on the VQA-CP v2 benchmark dataset.
arXiv Detail & Related papers (2021-02-03T07:33:30Z) - Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering [62.88124382512111]
Most Visual Question Answering (VQA) models suffer from the language prior problem.
We introduce a self-supervised learning framework to solve this problem.
Our method can significantly outperform the state-of-the-art.
arXiv Detail & Related papers (2020-12-17T12:30:12Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder [12.56413718364189]
We propose a novel model-agnostic question encoder, Visually-Grounded Question (VGQE) for VQA.
VGQE utilizes both visual and language modalities equally while encoding the question.
We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-07-13T05:36:36Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.