Measuring Faithful and Plausible Visual Grounding in VQA
- URL: http://arxiv.org/abs/2305.15015v2
- Date: Sat, 14 Oct 2023 15:20:13 GMT
- Title: Measuring Faithful and Plausible Visual Grounding in VQA
- Authors: Daniel Reich, Felix Putze, Tanja Schultz
- Abstract summary: Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question.
Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely.
We propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer.
- Score: 23.717744098159717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems
primarily aim to measure a system's reliance on relevant parts of the image
when inferring an answer to the given question. Lack of VG has been a common
problem among state-of-the-art VQA systems and can manifest in over-reliance on
irrelevant image parts or a disregard for the visual modality entirely.
Although inference capabilities of VQA models are often illustrated by a few
qualitative illustrations, most systems are not quantitatively assessed for
their VG properties. We believe, an easily calculated criterion for
meaningfully measuring a system's VG can help remedy this shortcoming, as well
as add another valuable dimension to model evaluations and analysis. To this
end, we propose a new VG metric that captures if a model a) identifies
question-relevant objects in the scene, and b) actually relies on the
information contained in the relevant objects when producing its answer, i.e.,
if its visual grounding is both "faithful" and "plausible". Our metric, called
"Faithful and Plausible Visual Grounding" (FPVG), is straightforward to
determine for most VQA model designs.
We give a detailed description of FPVG and evaluate several reference systems
spanning various VQA architectures. Code to support the metric calculations on
the GQA data set is available on GitHub.
Related papers
- VQA$^2$:Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment originally focused on quantitative video quality scoring.
It is now evolving towards more comprehensive visual quality understanding tasks.
We introduce the first visual question answering instruction dataset entirely focuses on video quality assessment.
We conduct extensive experiments on both video quality scoring and video quality understanding tasks.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - Uncovering the Full Potential of Visual Grounding Methods in VQA [23.600816131032936]
VG-methods attempt to improve Visual Question Answering (VQA) performance by strengthening a model's reliance on question-relevant visual information.
Training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits.
Our experiments show that these methods can be much more effective when evaluation conditions are corrected.
arXiv Detail & Related papers (2024-01-15T16:21:19Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Visual Question Answering in the Medical Domain [13.673890873313354]
We present a novel contrastive learning pretraining method to mitigate the problem of small datasets for the Med-VQA task.
Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
arXiv Detail & Related papers (2023-09-20T06:06:10Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - What's Different between Visual Question Answering for Machine
"Understanding" Versus for Accessibility? [8.373151777137792]
In visual question answering (VQA), a machine must answer a question given an associated image.
We evaluate discrepancies between machine "understanding" datasets (VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA models.
Based on our findings, we discuss opportunities and challenges in VQA for accessibility and suggest directions for future work.
arXiv Detail & Related papers (2022-10-26T18:23:53Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.