PromptCap: Prompt-Guided Task-Aware Image Captioning
- URL: http://arxiv.org/abs/2211.09699v4
- Date: Thu, 17 Aug 2023 21:43:24 GMT
- Title: PromptCap: Prompt-Guided Task-Aware Image Captioning
- Authors: Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo
Luo
- Abstract summary: We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs.
PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption.
We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
- Score: 118.39243917422492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge-based visual question answering (VQA) involves questions that
require world knowledge beyond the image to yield the correct answer. Large
language models (LMs) like GPT-3 are particularly helpful for this task because
of their strong knowledge retrieval and reasoning capabilities. To enable LM to
understand images, prior work uses a captioning model to convert images into
text. However, when summarizing an image in a single caption sentence, which
visual entities to describe are often underspecified. Generic image captions
often miss visual details essential for the LM to answer visual questions
correctly. To address this challenge, we propose PromptCap (Prompt-guided image
Captioning), a captioning model designed to serve as a better connector between
images and black-box LMs. Different from generic captions, PromptCap takes a
natural-language prompt to control the visual entities to describe in the
generated caption. The prompt contains a question that the caption should aid
in answering. To avoid extra annotation, PromptCap is trained by examples
synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's
effectiveness on an existing pipeline in which GPT-3 is prompted with image
captions to carry out VQA. PromptCap outperforms generic captions by a large
margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks
(60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that
PromptCap generalizes well to unseen domains.
Related papers
- SADL: An Effective In-Context Learning Method for Compositional Visual QA [22.0603596548686]
Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA.
This paper introduces SADL, a new visual-linguistic prompting framework for the task.
arXiv Detail & Related papers (2024-07-02T06:41:39Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - FlexCap: Generating Rich, Localized, and Flexible Captions in Images [54.796523366320486]
We introduce a versatile $textitflexible-captioning$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths.
The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes.
This allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.
arXiv Detail & Related papers (2024-03-18T17:57:02Z) - Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme
Detection [17.182722268446604]
We propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner.
Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions.
The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.
arXiv Detail & Related papers (2023-08-16T01:38:49Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [51.639880603821446]
We propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions for knowledge-based VQA.
We first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner.
By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset.
arXiv Detail & Related papers (2021-09-10T17:51:06Z) - Structural and Functional Decomposition for Personality Image Captioning
in a Communication Game [53.74847926974122]
Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait.
We introduce a novel formulation for PIC based on a communication game between a speaker and a listener.
arXiv Detail & Related papers (2020-11-17T10:19:27Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC)
ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant.
We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.