Related papers: Visually Grounded VQA by Lattice-based Retrieval

Visually Grounded VQA by Lattice-based Retrieval

URL: http://arxiv.org/abs/2211.08086v1
Date: Tue, 15 Nov 2022 12:12:08 GMT
Title: Visually Grounded VQA by Lattice-based Retrieval
Authors: Daniel Reich, Felix Putze, Tanja Schultz
Abstract summary: Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
Score: 24.298908211088072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios.

Related papers

On the Role of Visual Grounding in VQA [19.977539219231932]
"Visual Grounding" in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning. We propose a novel theoretical framework called "Visually Grounded Reasoning" (VGR) that uses the concepts of VG and Reasoning to describe VQA inference.
arXiv Detail & Related papers (2024-06-26T10:57:52Z)
Uncovering the Full Potential of Visual Grounding Methods in VQA [23.600816131032936]
VG-methods attempt to improve Visual Question Answering (VQA) performance by strengthening a model's reliance on question-relevant visual information. Training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits. Our experiments show that these methods can be much more effective when evaluation conditions are corrected.
arXiv Detail & Related papers (2024-01-15T16:21:19Z)
Measuring Faithful and Plausible Visual Grounding in VQA [23.717744098159717]
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. We propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer.
arXiv Detail & Related papers (2023-05-24T10:58:02Z)
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA) We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions. We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context. On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)
Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task. Our method first extracts the features and predicates from the image and question. We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z)
Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs [29.59479131119943]
"Adventurer's Treasure Hunt" (or ATH) is named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure. ATH is the first GQA-trained VQA system that dynamically extracts answers by querying the visual knowledge base directly. We report detailed results on all components and their contributions to overall VQA performance on the GQA dataset and show that ATH achieves the highest visual grounding score among all examined systems.
arXiv Detail & Related papers (2021-06-28T08:39:34Z)
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently. We propose a visual capsule module with a query-based selection mechanism of capsule features. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer. We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z)
C3VQG: Category Consistent Cyclic Visual Question Generation [51.339348810676896]
Visual Question Generation (VQG) is the task of generating natural questions based on an image. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations.
arXiv Detail & Related papers (2020-05-15T20:25:03Z)
In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well. We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.