MuKEA: Multimodal Knowledge Extraction and Accumulation for
Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2203.09138v1
- Date: Thu, 17 Mar 2022 07:42:14 GMT
- Title: MuKEA: Multimodal Knowledge Extraction and Accumulation for
Knowledge-based Visual Question Answering
- Authors: Yang Ding, Jing Yu, Bang Liu, Yue Hu, Mingxin Cui, Qi Wu
- Abstract summary: We propose MuKEA to represent multimodal knowledge by an explicit triplet to correlate visual objects and fact answers with implicit relations.
By adopting a pre-training and fine-tuning learning strategy, both basic and domain-specific multimodal knowledge are progressively accumulated for answer prediction.
- Score: 23.628740943735167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based visual question answering requires the ability of associating
external knowledge for open-ended cross-modal scene understanding. One
limitation of existing solutions is that they capture relevant knowledge from
text-only knowledge bases, which merely contain facts expressed by first-order
predicates or language descriptions while lacking complex but indispensable
multimodal knowledge for visual understanding. How to construct vision-relevant
and explainable multimodal knowledge for the VQA scenario has been less
studied. In this paper, we propose MuKEA to represent multimodal knowledge by
an explicit triplet to correlate visual objects and fact answers with implicit
relations. To bridge the heterogeneous gap, we propose three objective losses
to learn the triplet representations from complementary views: embedding
structure, topological relation and semantic space. By adopting a pre-training
and fine-tuning learning strategy, both basic and domain-specific multimodal
knowledge are progressively accumulated for answer prediction. We outperform
the state-of-the-art by 3.35% and 6.08% respectively on two challenging
knowledge-required datasets: OK-VQA and KRVQA. Experimental results prove the
complementary benefits of the multimodal knowledge with existing knowledge
bases and the advantages of our end-to-end framework over the existing pipeline
methods. The code is available at https://github.com/AndersonStra/MuKEA.
Related papers
- Knowledge Condensation and Reasoning for Knowledge-based VQA [20.808840633377343]
Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions.
We propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model.
Our method achieves state-of-the-art performance on knowledge-based VQA datasets.
arXiv Detail & Related papers (2024-03-15T06:06:06Z) - Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual
Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions.
Specifically, a well-defined detector is adopted to predict image-question related relation phrases.
The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Prophet: Prompting Large Language Models with Complementary Answer
Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering.
We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z) - Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model [63.461030694700014]
We propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD)
The proposed DKMD consists of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation.
Experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.
arXiv Detail & Related papers (2022-07-16T13:02:54Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - Contextualized Knowledge-aware Attentive Neural Network: Enhancing
Answer Selection with Knowledge [77.77684299758494]
We extensively investigate approaches to enhancing the answer selection model with external knowledge from knowledge graph (KG)
First, we present a context-knowledge interaction learning framework, Knowledge-aware Neural Network (KNN), which learns the QA sentence representations by considering a tight interaction with the external knowledge from KG and the textual information.
To handle the diversity and complexity of KG information, we propose a Contextualized Knowledge-aware Attentive Neural Network (CKANN), which improves the knowledge representation learning with structure information via a customized Graph Convolutional Network (GCN) and comprehensively learns context-based and knowledge-based sentence representation via
arXiv Detail & Related papers (2021-04-12T05:52:20Z) - Multi-Modal Answer Validation for Knowledge-Based VQA [44.80209704315099]
We propose Multi-modal Answer Validation using External knowledge (MAVEx)
The idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval.
Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-03-23T00:49:36Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.