Related papers: Describe Anything: Detailed Localized Image and Video Captioning

Describe Anything: Detailed Localized Image and Video Captioning

URL: http://arxiv.org/abs/2504.16072v1
Date: Tue, 22 Apr 2025 17:51:41 GMT
Title: Describe Anything: Detailed Localized Image and Video Captioning
Authors: Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui,
Abstract summary: We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC)<n>We propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP) to tackle the scarcity of high-quality DLC data.<n> DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
Score: 89.37016119012068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

Related papers

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World [68.39362698871503]
We present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world.<n>We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging.<n>To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model.
arXiv Detail & Related papers (2025-06-30T17:51:25Z)
URECA: Unique Region Caption Anything [29.363967361960043]
Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. We introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. We present URECA, a novel captioning model designed to effectively encode multi-granularity regions.
arXiv Detail & Related papers (2025-04-07T17:59:44Z)
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.<n>We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.<n>Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z)
Grounded Video Caption Generation [74.23767687855279]
We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset.
arXiv Detail & Related papers (2024-11-12T06:44:24Z)
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions [118.35194230865451]
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
arXiv Detail & Related papers (2024-11-12T00:52:52Z)
Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning? [29.237078890377514]
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content. Using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image. This paper proposes a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics.
arXiv Detail & Related papers (2024-06-18T14:33:56Z)
CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z)
QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.