VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering
- URL: http://arxiv.org/abs/2205.11501v2
- Date: Fri, 15 Sep 2023 08:16:01 GMT
- Title: VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering
- Authors: Yanan Wang, Michihiro Yasunaga, Hongyu Ren, Shinya Wada, Jure Leskovec
- Abstract summary: We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
- Score: 79.22069768972207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering (VQA) requires systems to perform concept-level
reasoning by unifying unstructured (e.g., the context in question and answer;
"QA context") and structured (e.g., knowledge graph for the QA context and
scene; "concept graph") multimodal knowledge. Existing works typically combine
a scene graph and a concept graph of the scene by connecting corresponding
visual nodes and concept nodes, then incorporate the QA context representation
to perform question answering. However, these methods only perform a
unidirectional fusion from unstructured knowledge to structured knowledge,
limiting their potential to capture joint reasoning over the heterogeneous
modalities of knowledge. To perform more expressive reasoning, we propose
VQA-GNN, a new VQA model that performs bidirectional fusion between
unstructured and structured multimodal knowledge to obtain unified knowledge
representations. Specifically, we inter-connect the scene graph and the concept
graph through a super node that represents the QA context, and introduce a new
multimodal GNN technique to perform inter-modal message passing for reasoning
that mitigates representational gaps between modalities. On two challenging VQA
tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2%
on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing
concept-level reasoning. Ablation studies further demonstrate the efficacy of
the bidirectional fusion and multimodal GNN method in unifying unstructured and
structured multimodal knowledge.
Related papers
- Improving and Diagnosing Knowledge-Based Visual Question Answering via
Entity Enhanced Knowledge Injection [14.678153928301493]
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image.
Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
arXiv Detail & Related papers (2021-12-13T18:45:42Z) - Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z) - Dynamic Semantic Graph Construction and Reasoning for Explainable
Multi-hop Science Question Answering [50.546622625151926]
We propose a new framework to exploit more valid facts while obtaining explainability for multi-hop QA.
Our framework contains three new ideas: (a) tt AMR-SG, an AMR-based Semantic Graph, constructed by candidate fact AMRs to uncover any hop relations among question, answer and multiple facts, (b) a novel path-based fact analytics approach exploiting tt AMR-SG to extract active facts from a large fact pool to answer questions, and (c) a fact-level relation modeling leveraging graph convolution network (GCN) to guide the reasoning process.
arXiv Detail & Related papers (2021-05-25T09:14:55Z) - QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question
Answering [122.84513233992422]
We propose a new model, QA-GNN, which addresses the problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs)
We show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning.
arXiv Detail & Related papers (2021-04-13T17:32:51Z) - Contextualized Knowledge-aware Attentive Neural Network: Enhancing
Answer Selection with Knowledge [77.77684299758494]
We extensively investigate approaches to enhancing the answer selection model with external knowledge from knowledge graph (KG)
First, we present a context-knowledge interaction learning framework, Knowledge-aware Neural Network (KNN), which learns the QA sentence representations by considering a tight interaction with the external knowledge from KG and the textual information.
To handle the diversity and complexity of KG information, we propose a Contextualized Knowledge-aware Attentive Neural Network (CKANN), which improves the knowledge representation learning with structure information via a customized Graph Convolutional Network (GCN) and comprehensively learns context-based and knowledge-based sentence representation via
arXiv Detail & Related papers (2021-04-12T05:52:20Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.