Related papers: Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

URL: http://arxiv.org/abs/2504.02328v1
Date: Thu, 03 Apr 2025 07:04:56 GMT
Title: Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Authors: Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang,
Abstract summary: Contrastive Language-Image Pre-training excels in global alignment with language but exhibits limited sensitivity to spatial information.<n>Recent approaches have introduced Region-Language Alignment to enhance CLIP's performance in dense multimodal tasks.<n>We propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation.
Score: 10.936397225984107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

Related papers

SuperCLIP: CLIP with Simple Classification Supervision [88.86549733903314]
Contrastive Language-Image Pretraining achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space.<n>Recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text.<n>We propose SuperCLIP, a framework that augments contrastive learning with classification-based supervision.
arXiv Detail & Related papers (2025-12-16T15:11:53Z)
Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation [22.409969687852506]
We propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image.<n>Our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2025-12-08T10:00:36Z)
HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models [63.87966115136411]
HarmoCLIP is a novel framework designed to harmonize global and region representations within Contrastive Language-Image Pre-training.<n>To strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy.
arXiv Detail & Related papers (2025-11-27T16:24:53Z)
Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective [47.99651635870674]
We propose a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions.<n>Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.
arXiv Detail & Related papers (2025-11-20T09:16:33Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation [55.486872677160015]
We propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head.<n>Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space.<n>We also propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token.
arXiv Detail & Related papers (2025-06-27T09:26:50Z)
ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction [3.7365850182404845]
Any-to-Any Self-Distillation (ATAS) is a novel approach that simultaneously enhances semantic coherence and fine-grained alignment.<n>ATAS achieves substantial performance gains on open-vocabulary object detection and semantic segmentation benchmarks.
arXiv Detail & Related papers (2025-06-10T10:40:10Z)
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception [21.87721909270275]
DeCLIP is a novel framework that enhances CLIP with content'' and context'' features respectively.<n>It significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks.
arXiv Detail & Related papers (2025-05-07T13:46:34Z)
Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation [19.26516470653798]
Weakly Supervised Semantic (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Maps (CAMs) Recent methods primarily focus on image-text alignment for CAM generation, while CLIP's potential in patch-text alignment remains unexplored. We propose ExCEL to explore CLIP's dense knowledge via a novel patchtext alignment paradigm for WSSS.
arXiv Detail & Related papers (2025-03-26T02:00:49Z)
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.<n>SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations. We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z)
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z)
Open-Vocabulary Segmentation with Semantic-Assisted Calibration [68.41025728960176]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP. We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z)
Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks. We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z)
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.