Related papers: FlexCap: Describe Anything in Images in Controllable Detail

FlexCap: Describe Anything in Images in Controllable Detail

URL: http://arxiv.org/abs/2403.12026v2
Date: Tue, 28 Jan 2025 23:14:21 GMT
Title: FlexCap: Describe Anything in Images in Controllable Detail
Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar,
Abstract summary: We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths.<n>Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog.
Score: 54.796523366320486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Related papers

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings. Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations. We re-annotated the COCO and Visual Genome datasets using our FullAnno system. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z)
VIXEN: Visual Text Comparison Network for Image Difference Captioning [58.16313862434814]
We present VIXEN, a technique that succinctly summarizes in text the visual differences between a pair of images. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model.
arXiv Detail & Related papers (2024-02-29T12:56:18Z)
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning [108.12011636732674]
MultiCapCLIP can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. Our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics.
arXiv Detail & Related papers (2023-08-25T07:32:34Z)
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning [45.855652838621936]
ViECap is a transferable decoding model that generates descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image. Our experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning.
arXiv Detail & Related papers (2023-07-31T09:47:06Z)
VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z)
PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs. PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z)
CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z)
Variational Stacked Local Attention Networks for Diverse Video Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z)
Question-controlled Text-aware Image Captioning [41.53906032024941]
Question-controlled Text-aware Image Captioning (Qc-TextCap) is a new challenging task. With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model. GQAM generates a personalized text-aware caption with a Multimodal Decoder.
arXiv Detail & Related papers (2021-08-04T13:34:54Z)
CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP) Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population. We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.