Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
- URL: http://arxiv.org/abs/2501.13893v1
- Date: Thu, 23 Jan 2025 18:08:57 GMT
- Title: Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
- Authors: Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu,
- Abstract summary: We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding.
This approach results in 167,254 detailed captions, with an average of 22.94 words per caption.
We also introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously.
- Score: 36.33160773256632
- License:
- Abstract: We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation. Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on large multimodal models (LMMs) to enhance their performance. For example, training with Pix2Cap-COCO significantly improves the performance of GPT4RoI, yielding gains in CIDEr +1.4%, ROUGE +0.4%, and SPICE +0.5% on Visual Genome dataset, and strengthens its region understanding ability on the ViP-BENCH, with an overall improvement of +5.1%, including notable increases in recognition accuracy +11.2% and language generation quality +22.2%.
Related papers
- COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation [38.09277249986138]
COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks.
COCONut-PanCap supports improved training of vision-language models for image understanding and generative models for text-to-image tasks.
arXiv Detail & Related papers (2025-02-04T18:59:46Z) - TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens [9.453667770656644]
We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens.
We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding.
We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale.
arXiv Detail & Related papers (2024-10-07T17:58:35Z) - HoloHisto: End-to-end Gigapixel WSI Segmentation with 4K Resolution Sequential Tokenization [21.1691961979094]
In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process.
We propose the holistic histopathology (HoloHisto) segmentation method to achieve end-to-end segmentation on gigapixel WSIs.
Under the HoloHisto platform, we unveil a random 4K sampler that transcends ultra-high resolution, delivering 31 and 10 times more pixels than standard 2D and 3D patches.
arXiv Detail & Related papers (2024-07-03T17:49:31Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation [110.10627872744254]
We introduce PixArt-Sigma, a Diffusion Transformer model capable of directly generating images at 4K resolution.
PixArt-Sigma offers images of markedly higher fidelity and improved alignment with text prompts.
arXiv Detail & Related papers (2024-03-07T17:41:37Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - CoFiI2P: Coarse-to-Fine Correspondences for Image-to-Point Cloud Registration [9.57539651520755]
CoFiI2P is a novel I2P registration network that extracts correspondences in a coarse-to-fine manner.
In the coarse matching phase, a novel I2P transformer module is employed to capture both homogeneous and heterogeneous global information.
In the fine matching module, point/pixel pairs are established with the guidance of super-point/super-pixel correspondences.
arXiv Detail & Related papers (2023-09-26T04:32:38Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture.
Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.