DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
- URL: http://arxiv.org/abs/2506.24102v1
- Date: Mon, 30 Jun 2025 17:51:25 GMT
- Title: DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
- Authors: Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi,
- Abstract summary: We present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world.<n>We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging.<n>To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model.
- Score: 68.39362698871503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.
Related papers
- Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z) - Describe Anything: Detailed Localized Image and Video Captioning [89.37016119012068]
We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC)<n>We propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP) to tackle the scarcity of high-quality DLC data.<n> DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
arXiv Detail & Related papers (2025-04-22T17:51:41Z) - URECA: Unique Region Caption Anything [29.363967361960043]
Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features.<n>We introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning.<n>We present URECA, a novel captioning model designed to effectively encode multi-granularity regions.
arXiv Detail & Related papers (2025-04-07T17:59:44Z) - LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models [44.578308186225826]
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data.<n>We show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance.
arXiv Detail & Related papers (2025-01-31T08:27:31Z) - Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.<n>We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.<n>Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z) - BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions [118.35194230865451]
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs.
KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions.
We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
arXiv Detail & Related papers (2024-11-12T00:52:52Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.