Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning
- URL: http://arxiv.org/abs/2602.04340v1
- Date: Wed, 04 Feb 2026 09:01:55 GMT
- Title: Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning
- Authors: Qian-Wei Wang, Yaguang Song, Shu-Tao Xia,
- Abstract summary: We propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning.<n>We show that our method consistently outperforms existing active learning methods under the same annotation budget.
- Score: 51.99383151474742
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
Related papers
- Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner [46.140724013144194]
Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data.<n>Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples.<n>We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism.
arXiv Detail & Related papers (2026-02-04T09:00:12Z) - BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation [30.435971066422706]
We show that the discriminative performance of state-of-the-art CLIP adapters does not always correlate with their uncertainty estimation capabilities.<n>We introduce BayesAdapter, which leverages Bayesian inference to estimate a full probability distribution instead of a single point.<n>Our approach obtains high quality uncertainty estimates in the predictions, standing out in calibration and selective classification.
arXiv Detail & Related papers (2024-12-12T20:48:06Z) - Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models [20.88680592729709]
We propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models.
BaFTA directly estimates class centroids using online clustering within a projected embedding space.
We demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-06-17T08:16:24Z) - Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Post-hoc Uncertainty Learning using a Dirichlet Meta-Model [28.522673618527417]
We propose a novel Bayesian meta-model to augment pre-trained models with better uncertainty quantification abilities.
Our proposed method requires no additional training data and is flexible enough to quantify different uncertainties.
We demonstrate our proposed meta-model approach's flexibility and superior empirical performance on these applications.
arXiv Detail & Related papers (2022-12-14T17:34:11Z) - Constraining Representations Yields Models That Know What They Don't
Know [2.729898906885749]
A well-known failure mode of neural networks is that they may confidently return erroneous predictions.
This work presents a novel direction to address these issues in a broad, general manner.
We assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code.
We train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class.
arXiv Detail & Related papers (2022-08-30T18:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.