CapWAP: Captioning with a Purpose
- URL: http://arxiv.org/abs/2011.04264v1
- Date: Mon, 9 Nov 2020 09:23:55 GMT
- Title: CapWAP: Captioning with a Purpose
- Authors: Adam Fisch, Kenton Lee, Ming-Wei Chang, Jonathan H. Clark, Regina
Barzilay
- Abstract summary: We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
- Score: 56.99405135645775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The traditional image captioning task uses generic reference captions to
provide textual information about images. Different user populations, however,
will care about different visual aspects of images. In this paper, we propose a
new task, Captioning with a Purpose (CapWAP). Our goal is to develop systems
that can be tailored to be useful for the information needs of an intended
population, rather than merely provide generic information about an image. In
this task, we use question-answer (QA) pairs---a natural expression of
information need---from users, instead of reference captions, for both training
and post-inference evaluation. We show that it is possible to use reinforcement
learning to directly optimize for the intended information need, by rewarding
outputs that allow a question answering model to provide correct answers to
sampled user questions. We convert several visual question answering datasets
into CapWAP datasets, and demonstrate that under a variety of scenarios our
purposeful captioning system learns to anticipate and fulfill specific
information needs better than its generic counterparts, as measured by QA
performance on user questions from unseen images, when using the caption alone
as context.
Related papers
- What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - CommVQA: Situating Visual Question Answering in Communicative Contexts [16.180130883242672]
We introduce CommVQA, a dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear.
We show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model.
arXiv Detail & Related papers (2024-02-22T22:31:39Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets.
It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images.
We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z) - PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs.
PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption.
We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Understanding Guided Image Captioning Performance across Domains [22.283016988026926]
We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text.
Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets.
arXiv Detail & Related papers (2020-12-04T00:05:02Z) - Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC)
ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant.
We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.