Adventurer's Treasure Hunt: A Transparent System for Visually Grounded
Compositional Visual Question Answering based on Scene Graphs
- URL: http://arxiv.org/abs/2106.14476v1
- Date: Mon, 28 Jun 2021 08:39:34 GMT
- Title: Adventurer's Treasure Hunt: A Transparent System for Visually Grounded
Compositional Visual Question Answering based on Scene Graphs
- Authors: Daniel Reich, Felix Putze, Tanja Schultz
- Abstract summary: "Adventurer's Treasure Hunt" (or ATH) is named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure.
ATH is the first GQA-trained VQA system that dynamically extracts answers by querying the visual knowledge base directly.
We report detailed results on all components and their contributions to overall VQA performance on the GQA dataset and show that ATH achieves the highest visual grounding score among all examined systems.
- Score: 29.59479131119943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the expressed goal of improving system transparency and visual grounding
in the reasoning process in VQA, we present a modular system for the task of
compositional VQA based on scene graphs. Our system is called "Adventurer's
Treasure Hunt" (or ATH), named after an analogy we draw between our model's
search procedure for an answer and an adventurer's search for treasure. We
developed ATH with three characteristic features in mind: 1. By design, ATH
allows us to explicitly quantify the impact of each of the sub-components on
overall VQA performance, as well as their performance on their individual
sub-task. 2. By modeling the search task after a treasure hunt, ATH inherently
produces an explicit, visually grounded inference path for the processed
question. 3. ATH is the first GQA-trained VQA system that dynamically extracts
answers by querying the visual knowledge base directly, instead of selecting
one from a specially learned classifier's output distribution over a pre-fixed
answer vocabulary. We report detailed results on all components and their
contributions to overall VQA performance on the GQA dataset and show that ATH
achieves the highest visual grounding score among all examined systems.
Related papers
- Blind Image Quality Assessment via Vision-Language Correspondence: A
Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information.
We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z) - Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual
Query Localization [119.23191388798921]
This paper deals with the problem of localizing objects in image and video datasets from visual exemplars.
We first identify grave implicit biases in current query-conditioned model design and visual query datasets.
We propose a novel transformer-based module that allows for object-proposal set context to be considered.
arXiv Detail & Related papers (2022-11-18T22:50:50Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA
Task [12.74065821307626]
VQA is an ambitious task aiming to answer any image-related question.
It is hard to build such a system once for all since the needs of users are continuously updated.
We propose a real-data-free replay-based method tailored for CL on VQA, named Scene Graph as Prompt for Replay.
arXiv Detail & Related papers (2022-08-24T12:00:02Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - Component Analysis for Visual Question Answering Architectures [10.56011196733086]
The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in Visual Question Answering models.
Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
arXiv Detail & Related papers (2020-02-12T17:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.