Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
- URL: http://arxiv.org/abs/2603.02618v1
- Date: Tue, 03 Mar 2026 05:44:47 GMT
- Title: Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
- Authors: Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang,
- Abstract summary: Out-of-distribution (OOD) detection seeks to identify samples from unknown classes.<n>Current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels.<n>We propose InterNeg, a framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives.
- Score: 80.03370593724422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.
Related papers
- Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model [42.29540047339044]
Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention.<n>We propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs.<n>Our method is designed to improve performance in both near and far OOD tasks.
arXiv Detail & Related papers (2026-01-20T15:06:10Z) - Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models [59.242742594156546]
CoEvo is a test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies.<n>CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
arXiv Detail & Related papers (2026-01-13T12:08:26Z) - Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models [33.39682202143465]
Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images.<n>Negative prompts are introduced to emphasize the dissimilarity between image features and prompt content.<n>We propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features.
arXiv Detail & Related papers (2025-11-14T03:24:09Z) - When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.71264263478083]
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning.<n>We include 546 multimodal problems, annotated with intermediate visual images and final answers.
arXiv Detail & Related papers (2025-11-04T18:00:51Z) - Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation [6.449894994514711]
Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection rely on similarity scores between input images and in-distribution (ID) text prototypes.<n>We propose incorporating ID image prototypes along with ID text prototypes to mitigate the impact of this modality gap.<n>We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training.
arXiv Detail & Related papers (2025-02-02T04:30:51Z) - AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models [15.754054667010468]
Pre-trained vision-language models are effective at identifying out-of-distribution (OOD) samples by using negative labels as guidance.
We introduce textitadaptive negative proxies, which are dynamically generated during testing by exploring actual OOD images.
Our approach significantly outperforms existing methods, with a 2.45% increase in AUROC and a 6.48% reduction in FPR95.
arXiv Detail & Related papers (2024-10-26T11:20:02Z) - Negative Label Guided OOD Detection with Pretrained Vision-Language Models [96.67087734472912]
Out-of-distribution (OOD) detection aims at identifying samples from unknown classes.
We propose a novel post hoc OOD detection method, called NegLabel, which takes a vast number of negative labels from extensive corpus databases.
arXiv Detail & Related papers (2024-03-29T09:19:52Z) - From Global to Local: Multi-scale Out-of-distribution Detection [129.37607313927458]
Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process.
Recent progress in representation learning gives rise to distance-based OOD detection.
We propose Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details.
arXiv Detail & Related papers (2023-08-20T11:56:25Z) - Triggering Failures: Out-Of-Distribution detection by learning from
local adversarial attacks in Semantic Segmentation [76.2621758731288]
We tackle the detection of out-of-distribution (OOD) objects in semantic segmentation.
Our main contribution is a new OOD detection architecture called ObsNet associated with a dedicated training scheme based on Local Adversarial Attacks (LAA)
We show it obtains top performances both in speed and accuracy when compared to ten recent methods of the literature on three different datasets.
arXiv Detail & Related papers (2021-08-03T17:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.