Uncovering the Full Potential of Visual Grounding Methods in VQA
- URL: http://arxiv.org/abs/2401.07803v2
- Date: Thu, 15 Feb 2024 14:18:20 GMT
- Title: Uncovering the Full Potential of Visual Grounding Methods in VQA
- Authors: Daniel Reich, Tanja Schultz
- Abstract summary: VG-methods attempt to improve Visual Question Answering (VQA) performance by strengthening a model's reliance on question-relevant visual information.
Training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits.
Our experiments show that these methods can be much more effective when evaluation conditions are corrected.
- Score: 23.600816131032936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to
improve VQA performance by strengthening a model's reliance on
question-relevant visual information. The presence of such relevant information
in the visual input is typically assumed in training and testing. This
assumption, however, is inherently flawed when dealing with imperfect image
representations common in large-scale VQA, where the information carried by
visual features frequently deviates from expected ground-truth contents. As a
result, training and testing of VG-methods is performed with largely inaccurate
data, which obstructs proper assessment of their potential benefits. In this
study, we demonstrate that current evaluation schemes for VG-methods are
problematic due to the flawed assumption of availability of relevant visual
information. Our experiments show that these methods can be much more effective
when evaluation conditions are corrected. Code is provided on GitHub.
Related papers
- Visual Question Answering in the Medical Domain [13.673890873313354]
We present a novel contrastive learning pretraining method to mitigate the problem of small datasets for the Med-VQA task.
Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
arXiv Detail & Related papers (2023-09-20T06:06:10Z) - Measuring Faithful and Plausible Visual Grounding in VQA [23.717744098159717]
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question.
Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely.
We propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer.
arXiv Detail & Related papers (2023-05-24T10:58:02Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Consistency-preserving Visual Question Answering in Medical Imaging [2.005299372367689]
Visual Question Answering (VQA) models take an image and a natural-language question as input and infer the answer to the question.
We propose a novel loss function and corresponding training procedure that allows the inclusion of relations between questions into the training process.
Our experiments show that our method outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2022-06-27T13:38:50Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Visual Grounding Methods for VQA are Working for the Wrong Reasons! [24.84797949716142]
We show that the performance improvements are not a result of improved visual grounding, but a regularization effect.
We propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
arXiv Detail & Related papers (2020-04-12T21:45:23Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.