Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning
- URL: http://arxiv.org/abs/2512.08606v2
- Date: Wed, 10 Dec 2025 03:56:26 GMT
- Title: Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning
- Authors: Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li,
- Abstract summary: The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations.<n>Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias.<n>This bias leads the model to rely on template proximity rather than true sample-to-category alignment.<n>We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information.
- Score: 28.14553408545859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.
Related papers
- Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning [51.99383151474742]
We propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning.<n>We show that our method consistently outperforms existing active learning methods under the same annotation budget.
arXiv Detail & Related papers (2026-02-04T09:01:55Z) - Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models [48.61795272482598]
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining.<n>But their performance can drop once the deployment distribution diverges from the training distribution.<n>Test-Time Adaptation (TTA) methods update models using unlabeled target data.<n>We propose textbfClass-Aware textbfPrototype textbfL with textbfNegative textbfContrast(textbfCPL-NC), a lightweight TTA framework
arXiv Detail & Related papers (2025-10-22T17:38:35Z) - Post-pre-training for Modality Alignment in Vision-Language Foundation Models [12.110530026601968]
This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning.<n>It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
arXiv Detail & Related papers (2025-04-17T07:46:19Z) - Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP.<n>We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment.<n>Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z) - Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements [10.687101698324897]
Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples.
The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning.
We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level.
arXiv Detail & Related papers (2024-01-12T18:58:26Z) - Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z) - ProTeCt: Prompt Tuning for Taxonomic Open Set Classification [59.59442518849203]
Few-shot adaptation methods do not fare well in the taxonomic open set (TOS) setting.
We propose a prompt tuning technique that calibrates the hierarchical consistency of model predictions.
A new Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed to calibrate classification across label set granularities.
arXiv Detail & Related papers (2023-06-04T02:55:25Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - One-bit Supervision for Image Classification [121.87598671087494]
One-bit supervision is a novel setting of learning from incomplete annotations.
We propose a multi-stage training paradigm which incorporates negative label suppression into an off-the-shelf semi-supervised learning algorithm.
arXiv Detail & Related papers (2020-09-14T03:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.