Uncovering the Full Potential of Visual Grounding Methods in VQA
- URL: http://arxiv.org/abs/2401.07803v2
- Date: Thu, 15 Feb 2024 14:18:20 GMT
- Title: Uncovering the Full Potential of Visual Grounding Methods in VQA
- Authors: Daniel Reich, Tanja Schultz
- Abstract summary: VG-methods attempt to improve Visual Question Answering (VQA) performance by strengthening a model's reliance on question-relevant visual information.
Training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits.
Our experiments show that these methods can be much more effective when evaluation conditions are corrected.
- Score: 23.600816131032936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to
improve VQA performance by strengthening a model's reliance on
question-relevant visual information. The presence of such relevant information
in the visual input is typically assumed in training and testing. This
assumption, however, is inherently flawed when dealing with imperfect image
representations common in large-scale VQA, where the information carried by
visual features frequently deviates from expected ground-truth contents. As a
result, training and testing of VG-methods is performed with largely inaccurate
data, which obstructs proper assessment of their potential benefits. In this
study, we demonstrate that current evaluation schemes for VG-methods are
problematic due to the flawed assumption of availability of relevant visual
information. Our experiments show that these methods can be much more effective
when evaluation conditions are corrected. Code is provided on GitHub.
Related papers
- Right this way: Can VLMs Guide Us to See More to Answer Questions? [11.693356269848517]
In question-answering scenarios, humans assess whether the available information is sufficient and seek additional information if necessary.
In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information.
This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
arXiv Detail & Related papers (2024-11-01T06:43:54Z) - Visual Question Answering in the Medical Domain [13.673890873313354]
We present a novel contrastive learning pretraining method to mitigate the problem of small datasets for the Med-VQA task.
Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
arXiv Detail & Related papers (2023-09-20T06:06:10Z) - Measuring Faithful and Plausible Visual Grounding in VQA [23.717744098159717]
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question.
Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely.
We propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer.
arXiv Detail & Related papers (2023-05-24T10:58:02Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - Consistency-preserving Visual Question Answering in Medical Imaging [2.005299372367689]
Visual Question Answering (VQA) models take an image and a natural-language question as input and infer the answer to the question.
We propose a novel loss function and corresponding training procedure that allows the inclusion of relations between questions into the training process.
Our experiments show that our method outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2022-06-27T13:38:50Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.