Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models
- URL: http://arxiv.org/abs/2305.18010v2
- Date: Wed, 21 Feb 2024 06:25:33 GMT
- Title: Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models
- Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
- Abstract summary: We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
- Score: 76.410400238974
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: One fascinating aspect of pre-trained vision-language models~(VLMs) learning
under language supervision is their impressive zero-shot generalization
capability. However, this ability is hindered by distribution shifts between
the training and testing data. Previous test time adaptation~(TTA) methods for
VLMs in zero-shot classification rely on minimizing the entropy of model
outputs, tending to be stuck in incorrect model predictions. In this work, we
propose TTA with feedback to rectify the model output and prevent the model
from becoming blindly confident. Specifically, a CLIP model is adopted as the
reward model during TTA and provides feedback for the VLM. Given a single test
sample, the VLM is forced to maximize the CLIP reward between the input and
sampled results from the VLM output distribution. The proposed
\textit{reinforcement learning with CLIP feedback~(RLCF)} framework is highly
flexible and universal. Beyond the classification task, with task-specific
sampling strategies and a proper reward baseline choice, RLCF can be easily
extended to not only discrimination tasks like retrieval but also
generalization tasks like image captioning, improving the zero-shot
generalization capacity of VLMs. According to the characteristics of these VL
tasks, we build different fully TTA pipelines with RLCF to improve the
zero-shot generalization ability of various VLMs. Extensive experiments along
with promising empirical results demonstrate the effectiveness of RLCF. The
code is available at https://github.com/mzhaoshuai/RLCF.
Related papers
- Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations.
Current approaches focus on using reinforcement learning with human feedback to improve model alignment.
We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z) - Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models [55.5610165938949]
Fine-tuning vision-language models (VLMs) has gained increasing popularity due to its practical value.
This paper explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model.
We introduce three customized ensemble strategies, each tailored to one specific scenario.
The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-28T05:17:25Z) - Robust Fine-Tuning of Vision-Language Models for Domain Generalization [6.7181844004432385]
Foundation models have impressive zero-shot inference capabilities and robustness under distribution shifts.
We present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP.
Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts.
arXiv Detail & Related papers (2023-11-03T20:50:40Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.