URECA: Unique Region Caption Anything
- URL: http://arxiv.org/abs/2504.05305v1
- Date: Mon, 07 Apr 2025 17:59:44 GMT
- Title: URECA: Unique Region Caption Anything
- Authors: Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim,
- Abstract summary: Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features.<n>We introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning.<n>We present URECA, a novel captioning model designed to effectively encode multi-granularity regions.
- Score: 29.363967361960043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.
Related papers
- Describe Anything: Detailed Localized Image and Video Captioning [89.37016119012068]
We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC)
We propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP) to tackle the scarcity of high-quality DLC data.
DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
arXiv Detail & Related papers (2025-04-22T17:51:41Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - A dual contrastive framework [7.358205057611624]
Region-level visual understanding presents significant challenges for large-scale vision-language models.<n>We propose AlignCap, a framework designed to enhance region-level understanding through fine-grained alignment of latent spaces.
arXiv Detail & Related papers (2024-12-13T18:45:18Z) - FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels.
We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - Zero-shot Text-guided Infinite Image Synthesis with LLM guidance [2.531998650341267]
There is a lack of text-image paired datasets with high-resolution and contextual diversity.<n>Expanding images based on text requires global coherence and rich local context understanding.<n>We propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding.
arXiv Detail & Related papers (2024-07-17T15:10:01Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - Self-supervised Scene Text Segmentation with Object-centric Layered
Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background.
On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z) - Diverse Image Captioning with Context-Object Split Latent Spaces [22.95979735707003]
We introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts.
Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data.
arXiv Detail & Related papers (2020-11-02T13:33:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.