Related papers: MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

URL: http://arxiv.org/abs/2512.07128v1
Date: Mon, 08 Dec 2025 03:23:41 GMT
Title: MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP
Authors: Chau Truong, Hieu Ta Quang, Dung D. Le,
Abstract summary: MulCLIP is an end-to-end framework that bridges natural long-text structures with image components.<n>It preserves global contrastive alignment between images and both summary and long captions.<n>It extends positional embeddings for longer text sequences.
Score: 4.6096940605642915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models like CLIP show impressive ability to align images and text, but their training on short, concise captions makes them struggle with lengthy, detailed descriptions. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. We introduce MulCLIP, a novel end-to-end multi-level alignment framework that bridges natural long-text structures with image components. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image patches, and (2) a subcaption-aggregated patch alignment that automatically extracts and aggregates context-rich patches for each subcaption. Experimental results across diverse benchmarks demonstrate our method consistently improves downstream performance, while ablation studies confirm its multi-scale alignment is the key factor driving better fine-grained capability than region-proposal-assisted approaches, making it particularly suitable for diverse real-world applications.

Related papers

Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z)
PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning [31.386303698437214]
We propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions.<n>We replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework.<n>Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-11-06T17:54:12Z)
Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs [0.351124620232225]
FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input.<n>FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens.<n>We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
arXiv Detail & Related papers (2025-04-02T17:19:59Z)
GOAL: Global-local Object Alignment Learning [7.9061560322289335]
Vision-language models like CLIP have shown impressive capabilities in aligning images and text.<n>They often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions.<n>We present GOAL, a novel fine-tuning method that enhances CLIP's ability to handle lengthy text.
arXiv Detail & Related papers (2025-03-22T14:27:32Z)
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.<n>We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.<n>Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z)
Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.<n>We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.<n>CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z)
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment. We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z)
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment [11.556516260190737]
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. This paper proposes Contrastive Captioners (CoCa) to integrate Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework.
arXiv Detail & Related papers (2024-01-04T08:42:36Z)
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns. We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.