Related papers: Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

URL: http://arxiv.org/abs/2009.00145v1
Date: Mon, 31 Aug 2020 23:25:01 GMT
Title: Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering
Authors: Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, Jianlong Tan
Abstract summary: Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
Score: 27.042604046441426
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing KVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the correct answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. Thereinto, the visual graph and semantic graph are regarded as image-conditioned instantiation of the factual graph. On top of these new representations, we re-formulate Knowledge-based Visual Question Answering as a recurrent reasoning process for obtaining complementary evidence from multimodal information. To this end, we decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol ( GRUC ) module that conducts parallel reasoning over both visual and semantic information. By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities. Finally, we perform graph neural networks to infer the global-optimal answer by jointly considering all the concepts. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA, and demonstrate the effectiveness and interpretability of our model with extensive experiments.

Related papers

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z)
Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context. On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)
Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering [18.926582410644375]
Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions. We propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR) Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets.
arXiv Detail & Related papers (2022-03-06T15:19:39Z)
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception. We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering [26.21870452615222]
FVQA requires external knowledge beyond visible content to answer questions about an image. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. We propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.
arXiv Detail & Related papers (2020-06-16T11:03:37Z)
C3VQG: Category Consistent Cyclic Visual Question Generation [51.339348810676896]
Visual Question Generation (VQG) is the task of generating natural questions based on an image. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations.
arXiv Detail & Related papers (2020-05-15T20:25:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.