Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering
- URL: http://arxiv.org/abs/2009.00145v1
- Date: Mon, 31 Aug 2020 23:25:01 GMT
- Title: Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering
- Authors: Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, Jianlong Tan
- Abstract summary: Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
- Score: 27.042604046441426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based Visual Question Answering (KVQA) requires external knowledge
beyond the visible content to answer questions about an image. This ability is
challenging but indispensable to achieve general VQA. One limitation of
existing KVQA solutions is that they jointly embed all kinds of information
without fine-grained selection, which introduces unexpected noises for
reasoning the correct answer. How to capture the question-oriented and
information-complementary evidence remains a key challenge to solve the
problem. Inspired by the human cognition theory, in this paper, we depict an
image by multiple knowledge graphs from the visual, semantic and factual views.
Thereinto, the visual graph and semantic graph are regarded as
image-conditioned instantiation of the factual graph. On top of these new
representations, we re-formulate Knowledge-based Visual Question Answering as a
recurrent reasoning process for obtaining complementary evidence from
multimodal information. To this end, we decompose the model into a series of
memory-based reasoning steps, each performed by a G raph-based R ead, U pdate,
and C ontrol ( GRUC ) module that conducts parallel reasoning over both visual
and semantic information. By stacking the modules multiple times, our model
performs transitive reasoning and obtains question-oriented concept
representations under the constrain of different modalities. Finally, we
perform graph neural networks to infer the global-optimal answer by jointly
considering all the concepts. We achieve a new state-of-the-art performance on
three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA, and
demonstrate the effectiveness and interpretability of our model with extensive
experiments.
Related papers
- A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category.
We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for
Knowledge-based Visual Question Answering [18.926582410644375]
Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions.
We propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR)
Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets.
arXiv Detail & Related papers (2022-03-06T15:19:39Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual
Question Answering [26.21870452615222]
FVQA requires external knowledge beyond visible content to answer questions about an image.
How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem.
We propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.
arXiv Detail & Related papers (2020-06-16T11:03:37Z) - C3VQG: Category Consistent Cyclic Visual Question Generation [51.339348810676896]
Visual Question Generation (VQG) is the task of generating natural questions based on an image.
In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers.
Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations.
arXiv Detail & Related papers (2020-05-15T20:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.