Related papers: Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models

Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models

URL: http://arxiv.org/abs/2503.09394v2
Date: Sat, 12 Jul 2025 06:36:58 GMT
Title: Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models
Authors: Xiaozhen Qiao, Peng Huang, Jiakang Yuan, Xianda Guo, Bowen Ye, Chaocan Xue, Ye Zheng, Zhe Sun, Xuelong Li,
Abstract summary: Test-time adaptation (TTA) is crucial in maintaining performance of Vision Language Models (VLMs) when facing distribution shifts.<n>We propose underlineBidirectional Prototype-Reward co-Evolution (BPRE)<n>BPRE integrates feature quality assessment with prototype evolution via a synergistic feedback loop.<n>Our model consistently achieves superior performance compared to other SOTA methods, and advances VLM generalization capabilities.
Score: 38.63571023556356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time adaptation (TTA) is crucial in maintaining performance of Vision Language Models (VLMs) when facing distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods predominantly leverage the output probability distribution of CLIP for feature evaluation, resulting in biases under domain shifts, which cause misclassified features due to text priors or incorrect textual associations. To address these issues, we propose \underline{B}idirectional Prototype-Reward co-Evolution (BPRE), a novel VLMs framework with TTA that integrates feature quality assessment with prototype evolution via a synergistic feedback loop. First, the Multi-dimensional Quality-aware Reward Module (MQRM) is designed to evaluate feature quality and guide prototype refinement precisely. The continuous refinement of prototype quality via Prototype-Reward Interactive Evolution (PRIE) enhances the computation more robust. Through this bidirectional interaction, the precision of rewards and prototype evolution mutually reinforce each other, forming a self-evolving feedback cycle. Extensive experiments conducted on 15 diverse recognition datasets demonstrate that our model consistently achieves superior performance compared to other SOTA methods, and advances VLM generalization capabilities through emphasizing comprehensive feature evaluation.

Related papers

Unified modality separation: A vision-language framework for unsupervised domain adaptation [60.8391821117794]
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains.<n>We propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components.<n>Our methods achieve up to 9% performance gain with 9 times of computational efficiencies.
arXiv Detail & Related papers (2025-08-07T02:51:10Z)
Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z)
DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection [10.834698906236405]
Out-of-distribution (OOD) detection is essential for ensuring the robustness of machine learning models. Recent advances in multimodal models have demonstrated the potential of leveraging multiple modalities to enhance detection performance. We propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection.
arXiv Detail & Related papers (2024-11-12T22:43:16Z)
PCoTTA: Continual Test-Time Adaptation for Multi-Task Point Cloud Understanding [40.42904797189929]
We present PCoTTA, an innovative framework for Continual Test-Time Adaptation (CoTTA) in multi-task point cloud understanding. Our PCoTTA involves three key components: automatic prototype mixture (APM), Gaussian Splatted feature shifting (GSFS), and contrastive prototype repulsion (CPR) CPR is proposed to pull the nearest learnable prototype close to the testing feature and push it away from other prototypes, making each prototype distinguishable during the adaptation.
arXiv Detail & Related papers (2024-11-01T14:41:36Z)
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models [11.545127156146368]
We introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for pre-trained vision-language models (VLMs) We create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time. Our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency.
arXiv Detail & Related papers (2024-10-16T17:59:49Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models [19.683461002518147]
Test-Time Prototype Shifting (TPS) is a pioneering approach designed to adapt vision-language models to test datasets using unlabeled test inputs.<n>TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering.<n>A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods.
arXiv Detail & Related papers (2024-03-19T17:54:34Z)
Embedded feature selection in LSTM networks with multi-objective evolutionary ensemble learning for time series forecasting [49.1574468325115]
We present a novel feature selection method embedded in Long Short-Term Memory networks. Our approach optimize the weights and biases of the LSTM in a partitioned manner. Experimental evaluations on air quality time series data from Italy and southeast Spain demonstrate that our method substantially improves the ability generalization of conventional LSTMs.
arXiv Detail & Related papers (2023-12-29T08:42:10Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Consistency Regularization for Generalizable Source-free Domain Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset. Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets. We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
Coded Residual Transform for Generalizable Deep Metric Learning [34.100840501900706]
We introduce a new method called coded residual transform (CRT) for deep metric learning to significantly improve its generalization capability. CRT represents and encodes the feature map from a set of complimentary perspectives based on projections onto diversified prototypes. Our experimental results and ablation studies demonstrate that the proposed CRT method outperform the state-of-the-art deep metric learning methods by large margins.
arXiv Detail & Related papers (2022-10-09T06:17:31Z)
Consistency Regularization for Deep Face Anti-Spoofing [69.70647782777051]
Face anti-spoofing (FAS) plays a crucial role in securing face recognition systems. Motivated by this exciting observation, we conjecture that encouraging feature consistency of different views may be a promising way to boost FAS models. We enhance both Embedding-level and Prediction-level Consistency Regularization (EPCR) in FAS.
arXiv Detail & Related papers (2021-11-24T08:03:48Z)
Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference. We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z)
Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification. It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.