Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual
Question Answering
- URL: http://arxiv.org/abs/2006.09073v3
- Date: Wed, 4 Nov 2020 01:36:36 GMT
- Title: Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual
Question Answering
- Authors: Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, Qi Wu
- Abstract summary: FVQA requires external knowledge beyond visible content to answer questions about an image.
How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem.
We propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.
- Score: 26.21870452615222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fact-based Visual Question Answering (FVQA) requires external knowledge
beyond visible content to answer questions about an image, which is challenging
but indispensable to achieve general VQA. One limitation of existing FVQA
solutions is that they jointly embed all kinds of information without
fine-grained selection, which introduces unexpected noises for reasoning the
final answer. How to capture the question-oriented and
information-complementary evidence remains a key challenge to solve the
problem. In this paper, we depict an image by a multi-modal heterogeneous
graph, which contains multiple layers of information corresponding to the
visual, semantic and factual features. On top of the multi-layer graph
representations, we propose a modality-aware heterogeneous graph convolutional
network to capture evidence from different layers that is most relevant to the
given question. Specifically, the intra-modal graph convolution selects
evidence from each modality and cross-modal graph convolution aggregates
relevant information across different modalities. By stacking this process
multiple times, our model performs iterative reasoning and predicts the optimal
answer by analyzing all question-oriented evidence. We achieve a new
state-of-the-art performance on the FVQA task and demonstrate the effectiveness
and interpretability of our model with extensive experiments.
Related papers
- A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category.
We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z) - Multimodal Commonsense Knowledge Distillation for Visual Question Answering [12.002744625599425]
We propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment.
This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.
arXiv Detail & Related papers (2024-11-05T01:37:16Z) - QAGCF: Graph Collaborative Filtering for Q&A Recommendation [58.21387109664593]
Question and answer (Q&A) platforms usually recommend question-answer pairs to meet users' knowledge acquisition needs.
This makes user behaviors more complex, and presents two challenges for Q&A recommendation.
We introduce Question & Answer Graph Collaborative Filtering (QAGCF), a graph neural network model that creates separate graphs for collaborative and semantic views.
arXiv Detail & Related papers (2024-06-07T10:52:37Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Bridge to Answer: Structure-aware Graph Interaction Network for Video
Question Answering [56.65656211928256]
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video.
We learn question conditioned visual graphs by exploiting the relation between video and question to enable each visual node using question-to-visual interactions.
Our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.
arXiv Detail & Related papers (2021-04-29T03:02:37Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.