A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA
- URL: http://arxiv.org/abs/2206.14989v1
- Date: Thu, 30 Jun 2022 02:35:04 GMT
- Title: A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA
- Authors: Yangyang Guo, Liqiang Nie, Yongkang Wong, Yibing Liu, Zhiyong Cheng
and Mohan Kankanhalli
- Abstract summary: This paper presents a unified end-to-end retriever-reader framework towards knowledge-based VQA.
We shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning.
Our scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering.
- Score: 67.75989848202343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based Visual Question Answering (VQA) expects models to rely on
external knowledge for robust answer prediction. Though significant it is, this
paper discovers several leading factors impeding the advancement of current
state-of-the-art methods. On the one hand, methods which exploit the explicit
knowledge take the knowledge as a complement for the coarsely trained VQA
model. Despite their effectiveness, these approaches often suffer from noise
incorporation and error propagation. On the other hand, pertaining to the
implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA
still remains largely unexplored. This work presents a unified end-to-end
retriever-reader framework towards knowledge-based VQA. In particular, we shed
light on the multi-modal implicit knowledge from vision-language pre-training
models to mine its potential in knowledge reasoning. As for the noise problem
encountered by the retrieval operation on explicit knowledge, we design a novel
scheme to create pseudo labels for effective knowledge supervision. This scheme
is able to not only provide guidance for knowledge retrieval, but also drop
these instances potentially error-prone towards question answering. To validate
the effectiveness of the proposed method, we conduct extensive experiments on
the benchmark dataset. The experimental results reveal that our method
outperforms existing baselines by a noticeable margin. Beyond the reported
numbers, this paper further spawns several insights on knowledge utilization
for future research with some empirical findings.
Related papers
- Large Language Models are Limited in Out-of-Context Knowledge Reasoning [65.72847298578071]
Large Language Models (LLMs) possess extensive knowledge and strong capabilities in performing in-context reasoning.
This paper focuses on a significant aspect of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z) - Knowledge Condensation and Reasoning for Knowledge-based VQA [20.808840633377343]
Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions.
We propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model.
Our method achieves state-of-the-art performance on knowledge-based VQA datasets.
arXiv Detail & Related papers (2024-03-15T06:06:06Z) - Distinguish Before Answer: Generating Contrastive Explanation as
Knowledge for Commonsense Question Answering [61.53454387743701]
We propose CPACE, a concept-centric Prompt-bAsed Contrastive Explanation Generation model.
CPACE converts obtained symbolic knowledge into a contrastive explanation for better distinguishing the differences among given candidates.
We conduct a series of experiments on three widely-used question-answering datasets: CSQA, QASC, and OBQA.
arXiv Detail & Related papers (2023-05-14T12:12:24Z) - Ontology-enhanced Prompt-tuning for Few-shot Learning [41.51144427728086]
Few-shot Learning is aimed to make predictions based on a limited number of samples.
Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks.
arXiv Detail & Related papers (2022-01-27T05:41:36Z) - KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA.
Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.
An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z) - Incremental Knowledge Based Question Answering [52.041815783025186]
We propose a new incremental KBQA learning framework that can progressively expand learning capacity as humans do.
Specifically, it comprises a margin-distilled loss and a collaborative selection method, to overcome the catastrophic forgetting problem.
The comprehensive experiments demonstrate its effectiveness and efficiency when working with the evolving knowledge base.
arXiv Detail & Related papers (2021-01-18T09:03:38Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Improving Commonsense Question Answering by Graph-based Iterative
Retrieval over Multiple Knowledge Sources [26.256653692882715]
How to engage commonsense effectively in question answering systems is still under exploration.
We propose a novel question-answering method by integrating ConceptNet, Wikipedia, and the Cambridge Dictionary.
We use a pre-trained language model to encode the question, retrieved knowledge and choices, and propose an answer choice-aware attention mechanism.
arXiv Detail & Related papers (2020-11-05T08:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.