FlexCap: Generating Rich, Localized, and Flexible Captions in Images
- URL: http://arxiv.org/abs/2403.12026v1
- Date: Mon, 18 Mar 2024 17:57:02 GMT
- Title: FlexCap: Generating Rich, Localized, and Flexible Captions in Images
- Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar,
- Abstract summary: We introduce a versatile $textitflexible-captioning$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths.
The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes.
This allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.
- Score: 54.796523366320486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .
Related papers
- FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations.
We re-annotated the COCO and Visual Genome datasets using our FullAnno system.
Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z) - MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
Captioning [108.12011636732674]
MultiCapCLIP can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.
Our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics.
arXiv Detail & Related papers (2023-08-25T07:32:34Z) - Transferable Decoding with Visual Entities for Zero-Shot Image
Captioning [45.855652838621936]
ViECap is a transferable decoding model that generates descriptions in both seen and unseen scenarios.
ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image.
Our experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning.
arXiv Detail & Related papers (2023-07-31T09:47:06Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs.
PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption.
We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z) - Variational Stacked Local Attention Networks for Diverse Video
Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z) - Question-controlled Text-aware Image Captioning [41.53906032024941]
Question-controlled Text-aware Image Captioning (Qc-TextCap) is a new challenging task.
With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model.
GQAM generates a personalized text-aware caption with a Multimodal Decoder.
arXiv Detail & Related papers (2021-08-04T13:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.