SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions
- URL: http://arxiv.org/abs/2001.06927v2
- Date: Tue, 16 Jun 2020 17:54:16 GMT
- Title: SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions
- Authors: Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz,
Marco Ribeiro, Besmira Nushi, Ece Kamar
- Abstract summary: We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
- Score: 66.86887670416193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing VQA datasets contain questions with varying levels of complexity.
While the majority of questions in these datasets require perception for
recognizing existence, properties, and spatial relationships of entities, a
significant portion of questions pose challenges that correspond to reasoning
tasks - tasks that can only be answered through a synthesis of perception and
knowledge about the world, logic and / or reasoning. Analyzing performance
across this distinction allows us to notice when existing VQA models have
consistency issues; they answer the reasoning questions correctly but fail on
associated low-level perception questions. For example, in Figure 1, models
answer the complex reasoning question "Is the banana ripe enough to eat?"
correctly, but fail on the associated perception question "Are the bananas
mostly green or yellow?" indicating that the model likely answered the
reasoning question correctly but for the wrong reason. We quantify the extent
to which this phenomenon occurs by creating a new Reasoning split of the VQA
dataset and collecting VQA-introspect, a new dataset1 which consists of 238K
new perception questions which serve as sub questions corresponding to the set
of perceptual tasks needed to effectively answer the complex reasoning
questions in the Reasoning split. Our evaluation shows that state-of-the-art
VQA models have comparable performance in answering perception and reasoning
questions, but suffer from consistency problems. To address this shortcoming,
we propose an approach called Sub-Question Importance-aware Network Tuning
(SQuINT), which encourages the model to attend to the same parts of the image
when answering the reasoning question and the perception sub question. We show
that SQuINT improves model consistency by ~5%, also marginally improving
performance on the Reasoning questions in VQA, while also displaying better
attention maps.
Related papers
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Improving Visual Question Answering Models through Robustness Analysis
and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models.
The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z) - Co-VQA : Answering by Interactive Sub Question Sequence [18.476819557695087]
This paper proposes a conversation-based VQA framework, which consists of three components: Questioner, Oracle, and Answerer.
To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets.
arXiv Detail & Related papers (2022-04-02T15:09:16Z) - NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset [26.782937852417454]
We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions.
We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores.
We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
arXiv Detail & Related papers (2021-09-22T09:17:09Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - SOrT-ing VQA Models : Contrastive Gradient Learning for Improved
Consistency [64.67155167618894]
We present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image.
Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an image, reasoning-question> pair.
We show that SOrT improves model consistency by upto 6.5% points over existing baselines, while also improving visual grounding.
arXiv Detail & Related papers (2020-10-20T05:15:48Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Understanding Knowledge Gaps in Visual Question Answering: Implications
for Gap Identification and Testing [20.117014315684287]
We use a taxonomy of Knowledge Gaps (KGs) to tag questions with one or more types of KGs.
We then examine the skew in the distribution of questions for each KG.
These new questions can be added to existing VQA datasets to increase the diversity of questions and reduce the skew.
arXiv Detail & Related papers (2020-04-08T00:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.