Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for
Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2203.02985v1
- Date: Sun, 6 Mar 2022 15:19:39 GMT
- Title: Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for
Knowledge-based Visual Question Answering
- Authors: Mingxiao Li, Marie-Francine Moens
- Abstract summary: Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions.
We propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR)
Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets.
- Score: 18.926582410644375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge-based visual question answering (VQA) is a vision-language task
that requires an agent to correctly answer image-related questions using
knowledge that is not presented in the given image. It is not only a more
challenging task than regular VQA but also a vital step towards building a
general VQA system. Most existing knowledge-based VQA systems process knowledge
and image information similarly and ignore the fact that the knowledge base
(KB) contains complete information about a triplet, while the extracted image
information might be incomplete as the relations between two objects are
missing or wrongly detected. In this paper, we propose a novel model named
dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which
performs explicit and implicit reasoning over a key-value knowledge memory
module and a spatial-aware image graph, respectively. Specifically, the memory
module learns a dynamic knowledge representation and generates a
knowledge-aware question representation at each reasoning step. Then, this
representation is used to guide a graph attention operator over the
spatial-aware image graph. Our model achieves new state-of-the-art accuracy on
the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the
effectiveness of each component of the proposed model.
Related papers
- Question-guided Knowledge Graph Re-scoring and Injection for Knowledge Graph Question Answering [27.414670144354453]
KGQA involves answering natural language questions by leveraging structured information stored in a knowledge graph.
We propose a Question-guided Knowledge Graph Re-scoring method (Q-KGR) to eliminate noisy pathways for the input question.
We also introduce Knowformer, a parameter-efficient method for injecting the re-scored knowledge graph into large language models to enhance their ability to perform factual reasoning.
arXiv Detail & Related papers (2024-10-02T10:27:07Z) - Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual
Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions.
Specifically, a well-defined detector is adopted to predict image-question related relation phrases.
The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z) - AVIS: Autonomous Visual Information Seeking with Large Language Model
Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS.
Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools.
AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.