Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models
- URL: http://arxiv.org/abs/2503.09394v1
- Date: Wed, 12 Mar 2025 13:40:33 GMT
- Title: Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models
- Authors: Xiaozhen Qiao, Peng Huang, Jiakang Yuan, Xianda Guo, Bowen Ye, Zhe Sun, Xuelong Li,
- Abstract summary: Bidirectional Prototype-Reward co-Evolution (BPRE) is a novel TTA framework for Vision-Language Models (VLMs)<n>BPRE integrates feature quality assessment with prototype evolution through a synergistic feedback loop.<n>BPRE consistently achieves superior average performance compared to state-of-the-art methods.
- Score: 39.238426311239564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time adaptation (TTA) is crucial in maintaining Vision-Language Models (VLMs) performance when facing real-world distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods rely on CLIP's output probability distribution for feature evaluation, which can introduce biases under domain shifts. This misalignment may cause features to be misclassified due to text priors or incorrect textual associations. To address these limitations, we propose Bidirectional Prototype-Reward co-Evolution (BPRE), a novel TTA framework for VLMs that integrates feature quality assessment with prototype evolution through a synergistic feedback loop. BPRE first employs a Multi-Dimensional Quality-Aware Reward Module to evaluate feature quality and guide prototype refinement precisely. The continuous refinement of prototype quality through Prototype-Reward Interactive Evolution will subsequently enhance the computation of more robust Multi-Dimensional Quality-Aware Reward Scores. Through the bidirectional interaction, the precision of rewards and the evolution of prototypes mutually reinforce each other, forming a self-evolving cycle. Extensive experiments are conducted across 15 diverse recognition datasets encompassing natural distribution shifts and cross-dataset generalization scenarios. Results demonstrate that BPRE consistently achieves superior average performance compared to state-of-the-art methods across different model architectures, such as ResNet-50 and ViT-B/16. By emphasizing comprehensive feature evaluation and bidirectional knowledge refinement, BPRE advances VLM generalization capabilities, offering a new perspective on TTA.
Related papers
- DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection [10.834698906236405]
Out-of-distribution (OOD) detection is essential for ensuring the robustness of machine learning models.
Recent advances in multimodal models have demonstrated the potential of leveraging multiple modalities to enhance detection performance.
We propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection.
arXiv Detail & Related papers (2024-11-12T22:43:16Z) - PCoTTA: Continual Test-Time Adaptation for Multi-Task Point Cloud Understanding [40.42904797189929]
We present PCoTTA, an innovative framework for Continual Test-Time Adaptation (CoTTA) in multi-task point cloud understanding.
Our PCoTTA involves three key components: automatic prototype mixture (APM), Gaussian Splatted feature shifting (GSFS), and contrastive prototype repulsion (CPR)
CPR is proposed to pull the nearest learnable prototype close to the testing feature and push it away from other prototypes, making each prototype distinguishable during the adaptation.
arXiv Detail & Related papers (2024-11-01T14:41:36Z) - Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models [11.545127156146368]
We introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for pre-trained vision-language models (VLMs)
We create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time.
Our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.
arXiv Detail & Related papers (2024-10-16T17:59:49Z) - MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models.
We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z) - Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models [19.683461002518147]
Test-Time Prototype Shifting (TPS) is a pioneering approach designed to adapt vision-language models to test datasets using unlabeled test inputs.<n>TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering.<n>A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods.
arXiv Detail & Related papers (2024-03-19T17:54:34Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Coded Residual Transform for Generalizable Deep Metric Learning [34.100840501900706]
We introduce a new method called coded residual transform (CRT) for deep metric learning to significantly improve its generalization capability.
CRT represents and encodes the feature map from a set of complimentary perspectives based on projections onto diversified prototypes.
Our experimental results and ablation studies demonstrate that the proposed CRT method outperform the state-of-the-art deep metric learning methods by large margins.
arXiv Detail & Related papers (2022-10-09T06:17:31Z) - Consistency Regularization for Deep Face Anti-Spoofing [69.70647782777051]
Face anti-spoofing (FAS) plays a crucial role in securing face recognition systems.
Motivated by this exciting observation, we conjecture that encouraging feature consistency of different views may be a promising way to boost FAS models.
We enhance both Embedding-level and Prediction-level Consistency Regularization (EPCR) in FAS.
arXiv Detail & Related papers (2021-11-24T08:03:48Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z) - Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification.
It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.