SOrT-ing VQA Models : Contrastive Gradient Learning for Improved
Consistency
- URL: http://arxiv.org/abs/2010.10038v2
- Date: Tue, 1 Dec 2020 02:11:13 GMT
- Title: SOrT-ing VQA Models : Contrastive Gradient Learning for Improved
Consistency
- Authors: Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath
R. Selvaraju
- Abstract summary: We present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image.
Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an image, reasoning-question> pair.
We show that SOrT improves model consistency by upto 6.5% points over existing baselines, while also improving visual grounding.
- Score: 64.67155167618894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research in Visual Question Answering (VQA) has revealed
state-of-the-art models to be inconsistent in their understanding of the world
-- they answer seemingly difficult questions requiring reasoning correctly but
get simpler associated sub-questions wrong. These sub-questions pertain to
lower level visual concepts in the image that models ideally should understand
to be able to answer the higher level question correctly. To address this, we
first present a gradient-based interpretability approach to determine the
questions most strongly correlated with the reasoning question on an image, and
use this to evaluate VQA models on their ability to identify the relevant
sub-questions needed to answer a reasoning question. Next, we propose a
contrastive gradient learning based approach called Sub-question Oriented
Tuning (SOrT) which encourages models to rank relevant sub-questions higher
than irrelevant questions for an <image, reasoning-question> pair. We show that
SOrT improves model consistency by upto 6.5% points over existing baselines,
while also improving visual grounding.
Related papers
- Improving Visual Question Answering Models through Robustness Analysis
and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models.
The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.