On the Role of Visual Grounding in VQA
- URL: http://arxiv.org/abs/2406.18253v1
- Date: Wed, 26 Jun 2024 10:57:52 GMT
- Title: On the Role of Visual Grounding in VQA
- Authors: Daniel Reich, Tanja Schultz,
- Abstract summary: "Visual Grounding" in VQA refers to a model's proclivity to infer answers based on question-relevant image regions.
DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning.
We propose a novel theoretical framework called "Visually Grounded Reasoning" (VGR) that uses the concepts of VG and Reasoning to describe VQA inference.
- Score: 19.977539219231932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Grounding (VG) in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in standard benchmarks. To uncover the impact of SC learning, Out-of-Distribution (OOD) tests have been proposed that expose a lack of VG with low accuracy. These tests have since been at the center of VG research and served as basis for various investigations into VG's impact on accuracy. However, the role of VG in VQA still remains not fully understood and has not yet been properly formalized. In this work, we seek to clarify VG's role in VQA by formalizing it on a conceptual level. We propose a novel theoretical framework called "Visually Grounded Reasoning" (VGR) that uses the concepts of VG and Reasoning to describe VQA inference in ideal OOD testing. By consolidating fundamental insights into VG's role in VQA, VGR helps to reveal rampant VG-related SC exploitation in OOD testing, which explains why the relationship between VG and OOD accuracy has been difficult to define. Finally, we propose an approach to create OOD tests that properly emphasize a requirement for VG, and show how to improve performance on them.
Related papers
- Secure Video Quality Assessment Resisting Adversarial Attacks [14.583834512620024]
Recent studies have revealed the vulnerability of existing VQA models against adversarial attacks.
This paper first attempts to investigate general adversarial defense principles, aiming at endowing existing VQA models with security.
We present a novel VQA framework from the security-oriented perspective, termed SecureVQA.
arXiv Detail & Related papers (2024-10-09T13:27:06Z) - Uncovering the Full Potential of Visual Grounding Methods in VQA [23.600816131032936]
VG-methods attempt to improve Visual Question Answering (VQA) performance by strengthening a model's reliance on question-relevant visual information.
Training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits.
Our experiments show that these methods can be much more effective when evaluation conditions are corrected.
arXiv Detail & Related papers (2024-01-15T16:21:19Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Iterative Robust Visual Grounding with Masked Reference based
Centerpoint Supervision [24.90534567531536]
We propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS)
The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets.
arXiv Detail & Related papers (2023-07-23T17:55:24Z) - Measuring Faithful and Plausible Visual Grounding in VQA [23.717744098159717]
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question.
Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely.
We propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer.
arXiv Detail & Related papers (2023-05-24T10:58:02Z) - Improving Visual Question Answering Models through Robustness Analysis
and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models.
The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - Introspective Distillation for Robust Question Answering [70.18644911309468]
Question answering (QA) models are well-known to exploit data bias, e.g., the language prior in visual QA and the position bias in reading comprehension.
Recent debiasing methods achieve good out-of-distribution (OOD) generalizability with a considerable sacrifice of the in-distribution (ID) performance.
We present a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA.
arXiv Detail & Related papers (2021-11-01T15:30:15Z) - Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering [59.20766562530209]
VQA models still tend to capture superficial linguistic correlations in the training set.
Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models.
We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
arXiv Detail & Related papers (2021-10-03T14:31:46Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.