Related papers: Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

URL: http://arxiv.org/abs/2507.03458v1
Date: Fri, 04 Jul 2025 10:24:26 GMT
Title: Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach
Authors: Leyan Xue, Zongbo Han, Guangyu Wang, Qinghua Hu, Mingyue Cheng, Changqing Zhang,
Abstract summary: Vision-Language Models (VLMs) like CLIP achieve cross-modal alignment through contrastive learning.<n>Traditional prompt engineering relies on coarse-grained category labels, neglecting fine-grained local semantics.<n>We propose a plug-and-play solution that enables CLIP to process localized visual descriptors.
Score: 43.419607730361996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model's receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.

Related papers

Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework [16.961220047066792]
ViLaCo is a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models.<n>ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
arXiv Detail & Related papers (2025-08-02T12:14:29Z)
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z)
Dual Guidance Semi-Supervised Action Detection [71.45023660211145]
We present a semi-supervised approach for spatial-temporal action localization.<n>We introduce a dual guidance network to select better pseudo-bounding boxes.<n>Our framework achieves superior results compared to extended image-based semi-supervised baselines.
arXiv Detail & Related papers (2025-07-28T18:08:36Z)
ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction [3.7365850182404845]
Any-to-Any Self-Distillation (ATAS) is a novel approach that simultaneously enhances semantic coherence and fine-grained alignment.<n>ATAS achieves substantial performance gains on open-vocabulary object detection and semantic segmentation benchmarks.
arXiv Detail & Related papers (2025-06-10T10:40:10Z)
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception [21.87721909270275]
DeCLIP is a novel framework that enhances CLIP with content'' and context'' features respectively.<n>It significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks.
arXiv Detail & Related papers (2025-05-07T13:46:34Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks. In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z)
A Closer Look at the Explainability of Contrastive Language-Image Pre-training [16.10032166963232]
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. We have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. We propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features.
arXiv Detail & Related papers (2023-04-12T07:16:55Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.