Prophet: Prompting Large Language Models with Complementary Answer
Heuristics for Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2303.01903v3
- Date: Thu, 14 Dec 2023 02:20:32 GMT
- Title: Prophet: Prompting Large Language Models with Complementary Answer
Heuristics for Knowledge-based Visual Question Answering
- Authors: Zhou Yu, Xuecheng Ouyang, Zhenwei Shao, Meng Wang, Jun Yu
- Abstract summary: Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering.
We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
- Score: 30.858737348472626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required
knowledge from explicit knowledge bases (KBs), which often introduces
irrelevant information to the question, hence restricting the performance of
their models. Recent works have resorted to using a powerful large language
model (LLM) as an implicit knowledge engine to acquire the necessary knowledge
for answering. Despite the encouraging results achieved by these methods, we
argue that they have not fully activated the capacity of the blind LLM as the
provided textual input is insufficient to depict the required visual
information to answer the question. In this paper, we present Prophet -- a
conceptually simple, flexible, and general framework designed to prompt LLM
with answer heuristics for knowledge-based VQA. Specifically, we first train a
vanilla VQA model on a specific knowledge-based VQA dataset without external
knowledge. After that, we extract two types of complementary answer heuristics
from the VQA model: answer candidates and answer-aware examples. Finally, the
two types of answer heuristics are jointly encoded into a formatted prompt to
facilitate the LLM's understanding of both the image and question, thus
generating a more accurate answer. By incorporating the state-of-the-art LLM
GPT-3, Prophet significantly outperforms existing state-of-the-art methods on
four challenging knowledge-based VQA datasets. To demonstrate the generality of
our approach, we instantiate Prophet with the combinations of different VQA
models (i.e., both discriminative and generative ones) and different LLMs
(i.e., both commercial and open-source ones).
Related papers
- Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models [10.526705722339775]
Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions.
Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs)
We propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion.
arXiv Detail & Related papers (2024-07-22T03:05:32Z) - Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge.
We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions.
Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv Detail & Related papers (2024-05-10T15:10:20Z) - Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering [11.183845003492964]
We use Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions.
DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information.
We propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions.
arXiv Detail & Related papers (2024-04-22T07:44:20Z) - Knowledge Condensation and Reasoning for Knowledge-based VQA [20.808840633377343]
Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions.
We propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model.
Our method achieves state-of-the-art performance on knowledge-based VQA datasets.
arXiv Detail & Related papers (2024-03-15T06:06:06Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Empowering Language Models with Knowledge Graph Reasoning for Question
Answering [117.79170629640525]
We propose knOwledge REasOning empowered Language Model (OREO-LM)
OREO-LM consists of a novel Knowledge Interaction Layer that can be flexibly plugged into existing Transformer-based LMs.
We show significant performance gain, achieving state-of-art results in the Closed-Book setting.
arXiv Detail & Related papers (2022-11-15T18:26:26Z) - GreaseLM: Graph REASoning Enhanced Language Models for Question
Answering [159.9645181522436]
GreaseLM is a new model that fuses encoded representations from pretrained LMs and graph neural networks over multiple layers of modality interaction operations.
We show that GreaseLM can more reliably answer questions that require reasoning over both situational constraints and structured knowledge, even outperforming models 8x larger.
arXiv Detail & Related papers (2022-01-21T19:00:05Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.