GeReA: Question-Aware Prompt Captions for Knowledge-based Visual
Question Answering
- URL: http://arxiv.org/abs/2402.02503v1
- Date: Sun, 4 Feb 2024 14:28:23 GMT
- Title: GeReA: Question-Aware Prompt Captions for Knowledge-based Visual
Question Answering
- Authors: Ziyu Ma, Shutao Li, Bin Sun, Jianfei Cai, Zuxiang Long, and Fuyan Ma
- Abstract summary: We argue that multimodal large language model (MLLM) is a better implicit knowledge engine than the large language model (LLM) for its superior capability of visual understanding.
We propose GeReA, a generate-reason framework that prompts a MLLM like InstructBLIP with question relevant vision and language information to generate knowledge-relevant descriptions.
Specifically, the question-relevant image regions and question-specific manual prompts are encoded in the MLLM to generate the knowledge relevant descriptions.
- Score: 37.11794716736831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based visual question answering (VQA) requires world knowledge
beyond the image for accurate answer. Recently, instead of extra knowledge
bases, a large language model (LLM) like GPT-3 is activated as an implicit
knowledge engine to jointly acquire and reason the necessary knowledge for
answering by converting images into textual information (e.g., captions and
answer candidates). However, such conversion may introduce irrelevant
information, which causes the LLM to misinterpret images and ignore visual
details crucial for accurate knowledge. We argue that multimodal large language
model (MLLM) is a better implicit knowledge engine than the LLM for its
superior capability of visual understanding. Despite this, how to activate the
capacity of MLLM as the implicit knowledge engine has not been explored yet.
Therefore, we propose GeReA, a generate-reason framework that prompts a MLLM
like InstructBLIP with question relevant vision and language information to
generate knowledge-relevant descriptions and reasons those descriptions for
knowledge-based VQA. Specifically, the question-relevant image regions and
question-specific manual prompts are encoded in the MLLM to generate the
knowledge relevant descriptions, referred to as question-aware prompt captions.
After that, the question-aware prompt captions, image-question pair, and
similar samples are sent into the multi-modal reasoning model to learn a joint
knowledge-image-question representation for answer prediction. GeReA unlocks
the use of MLLM as the implicit knowledge engine, surpassing all previous
state-of-the-art methods on OK-VQA and A-OKVQA datasets, with test accuracies
of 66.5% and 63.3% respectively. Our code will be released at
https://github.com/Upper9527/GeReA.
Related papers
- Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models [10.526705722339775]
Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions.
Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs)
We propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion.
arXiv Detail & Related papers (2024-07-22T03:05:32Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models [51.72963030032491]
Knowledge documents for large language models (LLMs) may conflict with the memory of LLMs due to outdated or incorrect knowledge.
We construct a new dataset, dubbed KNOT, for knowledge conflict resolution examination in the form of question answering.
arXiv Detail & Related papers (2024-04-04T16:40:11Z) - Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z) - Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge
Graph Question Answering [7.888547093390469]
Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks.
We propose to augment the knowledge directly in the input of LLMs.
Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot.
arXiv Detail & Related papers (2023-06-07T04:15:21Z) - Prophet: Prompting Large Language Models with Complementary Answer
Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering.
We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z) - VLC-BERT: Visual Question Answering with Contextualized Commonsense
Knowledge [48.457788853408616]
We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues.
We show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases.
arXiv Detail & Related papers (2022-10-24T22:01:17Z) - GreaseLM: Graph REASoning Enhanced Language Models for Question
Answering [159.9645181522436]
GreaseLM is a new model that fuses encoded representations from pretrained LMs and graph neural networks over multiple layers of modality interaction operations.
We show that GreaseLM can more reliably answer questions that require reasoning over both situational constraints and structured knowledge, even outperforming models 8x larger.
arXiv Detail & Related papers (2022-01-21T19:00:05Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.