Related papers: Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

URL: http://arxiv.org/abs/2601.08476v1
Date: Tue, 13 Jan 2026 12:08:26 GMT
Title: Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models
Authors: Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, Jing Qin,
Abstract summary: CoEvo is a test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies.<n>CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
Score: 59.242742594156546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.

Related papers

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs [80.03370593724422]
Out-of-distribution (OOD) detection seeks to identify samples from unknown classes.<n>Current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels.<n>We propose InterNeg, a framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives.
arXiv Detail & Related papers (2026-03-03T05:44:47Z)
ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models [32.840734752367275]
Prototype-based Double-Check Separation (ProtoDCS) is a robust framework for OSTTA.<n>It separates csID and csOOD samples, enabling safe and efficient adaptation of Vision-Language Models to csID data.<n>ProtoDCS significantly boosts both known-class accuracy and OOD detection metrics.
arXiv Detail & Related papers (2026-02-27T03:39:02Z)
Enhancing CLIP Robustness via Cross-Modality Alignment [54.01929554563447]
We propose Cross-modality Alignment, an optimal transport-based framework for vision-language models.<n> COLA restores global image-text alignment and local structural consistency in the feature space.<n> COLA is training-free and compatible with existing fine-tuned models.
arXiv Detail & Related papers (2025-10-28T03:47:44Z)
GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection [61.96025941146103]
GOOD is a novel framework that guides sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers.<n> GOOD incorporates dual-level guidance: Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space.<n>We introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness.
arXiv Detail & Related papers (2025-10-20T03:58:46Z)
Knowledge Regularized Negative Feature Tuning of Vision-Language Models for Out-of-Distribution Detection [54.433899174017185]
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models.<n>We propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT)<n>NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces.<n>When trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44%.
arXiv Detail & Related papers (2025-07-26T07:44:04Z)
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection [11.332987462182713]
In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics.<n>We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder framework that harnesses this mechanism.<n>DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks.
arXiv Detail & Related papers (2025-01-14T10:49:26Z)
AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models [15.754054667010468]
Pre-trained vision-language models are effective at identifying out-of-distribution (OOD) samples by using negative labels as guidance. We introduce textitadaptive negative proxies, which are dynamically generated during testing by exploring actual OOD images. Our approach significantly outperforms existing methods, with a 2.45% increase in AUROC and a 6.48% reduction in FPR95.
arXiv Detail & Related papers (2024-10-26T11:20:02Z)
Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation [20.79048009076496]
We propose a Progressive Proxy Anchor Propagation (PPAP) strategy for image-level pretrained models. This strategy gradually identifies more trustworthy positives for each anchor by relocating its proxy to regions densely populated with semantically similar samples. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for Unsupervised Semantic segmentation.
arXiv Detail & Related papers (2024-07-17T10:28:51Z)
Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection [24.145060992747077]
We propose a Likelihood-Aware Semantic Alignment (LSA) framework to promote the image-text correspondence into semantically high-likelihood regions. Extensive experiments demonstrate the remarkable OOD detection performance of our proposed LSA, surpassing existing methods by a margin of $15.26%$ and $18.88%$ on two F-OOD benchmarks.
arXiv Detail & Related papers (2023-12-04T08:53:59Z)
Higher Performance Visual Tracking with Dual-Modal Localization [106.91097443275035]
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. We propose a dual-modal framework for target localization, consisting of robust localization suppressingors via ONR and the accurate localization attending to the target center precisely via OFC.
arXiv Detail & Related papers (2021-03-18T08:47:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.