Related papers: microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

URL: http://arxiv.org/abs/2510.02270v1
Date: Thu, 02 Oct 2025 17:47:39 GMT
Title: microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
Authors: Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan,
Abstract summary: Unsupervised adaptation of CLIP-based vision models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues.<n>We propose $textbfmicroCLIP$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-language cues.
Score: 22.795156284628053
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

Related papers

SuperCLIP: CLIP with Simple Classification Supervision [88.86549733903314]
Contrastive Language-Image Pretraining achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space.<n>Recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text.<n>We propose SuperCLIP, a framework that augments contrastive learning with classification-based supervision.
arXiv Detail & Related papers (2025-12-16T15:11:53Z)
$β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment [53.42377319350806]
$$-CLIP is a multi-granular text-conditioned contrastive learning framework.<n>$$-CAL addresses the semantic overlap inherent in this hierarchy.<n>$$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence.
arXiv Detail & Related papers (2025-12-14T13:03:20Z)
Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection [25.349261412750586]
This study introduces textbfFiSeCLIP for ZSAD with training-free textbfCLIP, combining the feature matching with the cross-modal alignment.<n>Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks.
arXiv Detail & Related papers (2025-07-15T05:42:17Z)
Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation [55.486872677160015]
We propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head.<n>Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space.<n>We also propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token.
arXiv Detail & Related papers (2025-06-27T09:26:50Z)
CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections [22.32157080294386]
We propose a label-free prompt-tuning method to enhance CLIP-based image classification performance using unlabeled images.<n>Our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFTer.
arXiv Detail & Related papers (2024-11-28T19:48:54Z)
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.<n>SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z)
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification [53.89380284760555]
textttFOCI (textbfFine-grained textbfObject textbfClasstextbfIfication) is a difficult multiple-choice benchmark for fine-grained object classification. textttFOCIxspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k.
arXiv Detail & Related papers (2024-06-20T16:59:39Z)
Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation. Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z)
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC)<n>SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.<n>We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z)
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation [6.181169909576527]
Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones. Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance. We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
arXiv Detail & Related papers (2023-10-03T09:33:47Z)
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts [135.4317555866831]
Most existing methods require labeled data for all classes, which may not hold in real-world applications. We propose a plug-and-play generative approach called textbfSynttextbfHestextbfIzed textbfPrompts(textbfSHIP) to improve existing fine-tuning methods.
arXiv Detail & Related papers (2023-07-14T15:15:45Z)
[CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP. Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder. Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.