O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models
- URL: http://arxiv.org/abs/2503.12096v1
- Date: Sat, 15 Mar 2025 11:45:54 GMT
- Title: O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models
- Authors: Ashshak Sharifdeen, Muhammad Akhtar Munir, Sanoojan Baliah, Salman Khan, Muhammad Haris Khan,
- Abstract summary: Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning.<n>The resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models.<n>We propose a new approach, called O-TPT, that introduces orthogonality constraints on the textual features corresponding to the learnable prompts.
- Score: 17.56932003351322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. Towards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion. We conduct extensive experiments on various datasets with different backbones and baselines. The results indicate that our method consistently outperforms the prior state of the art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks.
Related papers
- Noise is an Efficient Learner for Zero-Shot Vision-Language Models [9.519280130389935]
Test-Time Noise Tuning is a novel method for handling unpredictable shifts in the visual space.<n>We introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding.<n>These improvements lay a strong foundation for adaptive out-of-distribution handling.
arXiv Detail & Related papers (2025-02-09T20:25:42Z) - Feature Clipping for Uncertainty Calibration [24.465567005078135]
Modern deep neural networks (DNNs) often suffer from overconfidence, leading to miscalibration.
We propose a novel post-hoc calibration method called feature clipping (FC) to address this issue.
FC involves clipping feature values to a specified threshold, effectively increasing entropy in high calibration error samples.
arXiv Detail & Related papers (2024-10-16T06:44:35Z) - BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models [20.88680592729709]
We propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models.
BaFTA directly estimates class centroids using online clustering within a projected embedding space.
We demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-06-17T08:16:24Z) - C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion [54.81141583427542]
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data.
This paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP.
We present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration.
arXiv Detail & Related papers (2024-03-21T04:08:29Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Open-Vocabulary Calibration for Fine-tuned CLIP [44.82453633696438]
The confidence calibration problem in fine-tuned vision-language models (VLMs) could greatly reduce reliability when deploying such models in the real world.
This paper bridges the gap by systematically investigating the confidence calibration problem in the context of prompt learning.
We present a simple and effective approach called Distance-Aware (DAC), which is based on scaling the temperature using as guidance the distance between predicted text labels and base classes.
arXiv Detail & Related papers (2024-02-07T08:42:48Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models [101.5066760592534]
We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs)
CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap.
Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
arXiv Detail & Related papers (2021-09-24T08:07:29Z) - Uncertainty Quantification and Deep Ensembles [79.4957965474334]
We show that deep-ensembles do not necessarily lead to improved calibration properties.
We show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models.
This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce.
arXiv Detail & Related papers (2020-07-17T07:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.