Related papers: PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

URL: http://arxiv.org/abs/2511.04601v1
Date: Thu, 06 Nov 2025 17:54:12 GMT
Title: PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
Authors: Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang,
Abstract summary: We propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions.<n>We replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework.<n>Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.
Score: 31.386303698437214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

Related papers

MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP [4.6096940605642915]
MulCLIP is an end-to-end framework that bridges natural long-text structures with image components.<n>It preserves global contrastive alignment between images and both summary and long captions.<n>It extends positional embeddings for longer text sequences.
arXiv Detail & Related papers (2025-12-08T03:23:41Z)
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs [0.351124620232225]
FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input.<n>FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens.<n>We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
arXiv Detail & Related papers (2025-04-02T17:19:59Z)
GOAL: Global-local Object Alignment Learning [7.9061560322289335]
Vision-language models like CLIP have shown impressive capabilities in aligning images and text.<n>They often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions.<n>We present GOAL, a novel fine-tuning method that enhances CLIP's ability to handle lengthy text.
arXiv Detail & Related papers (2025-03-22T14:27:32Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.<n>Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images. Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z)
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts. Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation. We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities. We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z)
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes. Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.