An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
- URL: http://arxiv.org/abs/2109.05014v1
- Date: Fri, 10 Sep 2021 17:51:06 GMT
- Title: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
- Authors: Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng
Liu, Lijuan Wang
- Abstract summary: We propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions for knowledge-based VQA.
We first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner.
By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset.
- Score: 51.639880603821446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based visual question answering (VQA) involves answering questions
that require external knowledge not present in the image. Existing methods
first retrieve knowledge from external resources, then reason over the selected
knowledge, the input image, and question for answer prediction. However, this
two-step approach could lead to mismatches that potentially limit the VQA
performance. For example, the retrieved knowledge might be noisy and irrelevant
to the question, and the re-embedded knowledge features during reasoning might
deviate from their original meanings in the knowledge base (KB). To address
this challenge, we propose PICa, a simple yet effective method that Prompts
GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by
GPT-3's power in knowledge retrieval and question answering, instead of using
structured KBs as in previous work, we treat GPT-3 as an implicit and
unstructured KB that can jointly acquire and process relevant knowledge.
Specifically, we first convert the image into captions (or tags) that GPT-3 can
understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just
providing a few in-context VQA examples. We further boost performance by
carefully investigating: (i) what text formats best describe the image content,
and (ii) how in-context examples can be better selected and used. PICa unlocks
the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa
surpasses the supervised state of the art by an absolute +8.6 points on the
OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent
few-shot performance.
Related papers
- Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering [11.183845003492964]
We use Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions.
DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information.
We propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions.
arXiv Detail & Related papers (2024-04-22T07:44:20Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - ConVQG: Contrastive Visual Question Generation with Multimodal Guidance [20.009626292937995]
We propose Contrastive Visual Question Generation (ConVQG) to generate image-grounded, text-guided, and knowledge-rich questions.
Experiments on knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-02-20T09:20:30Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - A Simple Baseline for Knowledge-Based Visual Question Answering [78.00758742784532]
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA)
Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline.
Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets.
arXiv Detail & Related papers (2023-10-20T15:08:17Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs.
PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption.
We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z) - Can Open Domain Question Answering Systems Answer Visual Knowledge
Questions? [7.442099405543527]
We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions.
This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering.
We propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions.
arXiv Detail & Related papers (2022-02-09T06:47:40Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.