ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
- URL: http://arxiv.org/abs/2602.23653v1
- Date: Fri, 27 Feb 2026 03:39:02 GMT
- Title: ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
- Authors: Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan, Zhiquan Wen, Mingkui Tan,
- Abstract summary: Prototype-based Double-Check Separation (ProtoDCS) is a robust framework for OSTTA.<n>It separates csID and csOOD samples, enabling safe and efficient adaptation of Vision-Language Models to csID data.<n>ProtoDCS significantly boosts both known-class accuracy and OOD detection metrics.
- Score: 32.840734752367275
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open-set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter-update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype-based Double-Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double-check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence-driven adaptation strategy utilizing uncertainty-aware loss and efficient prototype-level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C demonstrate that ProtoDCS achieves state-of-the-art performance, significantly boosting both known-class accuracy and OOD detection metrics. Code will be available at https://github.com/O-YangF/ProtoDCS.
Related papers
- Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models [59.242742594156546]
CoEvo is a test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies.<n>CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
arXiv Detail & Related papers (2026-01-13T12:08:26Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching [14.503330877000758]
Time-Conditioned Contraction Matching is a novel method for semi-supervised anomaly detection in tabular data.<n>It is inspired by flow matching, a recent generative modeling framework that learns velocity fields between probability distributions.<n>Extensive experiments on the ADBench benchmark show that TCCM strikes a favorable balance between detection accuracy and inference cost.
arXiv Detail & Related papers (2025-10-21T06:26:38Z) - Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models [86.53246292425699]
We present BCA+, a training-free framework for TTA for both object recognition and detection.<n>We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction.<n>BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.
arXiv Detail & Related papers (2025-10-03T06:27:33Z) - Knowledge Regularized Negative Feature Tuning of Vision-Language Models for Out-of-Distribution Detection [54.433899174017185]
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models.<n>We propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT)<n>NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces.<n>When trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44%.
arXiv Detail & Related papers (2025-07-26T07:44:04Z) - Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations [67.35596444651037]
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable.<n>We propose a Reliable Test-time Adaptation (ReTA) method that enhances reliability from two perspectives.
arXiv Detail & Related papers (2025-07-13T05:37:33Z) - Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models [13.157596316463621]
Test-time adaptation (TTA) of visual language models has attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks.<n>We introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called "Cache, Residual, Gaussian" (CRG)<n> Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.
arXiv Detail & Related papers (2025-03-24T04:32:35Z) - Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving [7.064497253920508]
Vision Foundation Models (VFMs) as feature extractors and density modeling techniques are proposed.<n>A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs.<n>Our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance.
arXiv Detail & Related papers (2025-01-14T12:51:34Z) - DOTA: Distributional Test-Time Adaptation of Vision-Language Models [69.41389326333771]
Vision-language foundation models can be unreliable when significant distribution gaps exist between training and test data.<n>We propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation.<n>This distribution-centric approach enables the model to continually learn and adapt to the deployment environment.
arXiv Detail & Related papers (2024-09-28T15:03:28Z) - Unified Entropy Optimization for Open-Set Test-Time Adaptation [40.111891407629]
Test-time adaptation (TTA) aims at adapting a model pre-trained on the labeled source domain to the unlabeled target domain.
Many state-of-the-art closed-set TTA methods perform poorly when applied to open-set scenarios.
We propose a simple but effective framework called unified entropy optimization (UniEnt)
arXiv Detail & Related papers (2024-04-09T07:08:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.