Related papers: Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

URL: http://arxiv.org/abs/2601.14052v1
Date: Tue, 20 Jan 2026 15:06:10 GMT
Title: Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model
Authors: Haoran Xu, Yanlin Liu, Zizhao Tong, Jiaze Li, Kexue Fu, Yuyang Zhang, Longxiang Gao, Shuaiguang Li, Xingyu Li, Yanran Xu, Changwei Wang,
Abstract summary: Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention.<n>We propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs.<n>Our method is designed to improve performance in both near and far OOD tasks.
Score: 42.29540047339044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.

Related papers

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs [80.03370593724422]
Out-of-distribution (OOD) detection seeks to identify samples from unknown classes.<n>Current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels.<n>We propose InterNeg, a framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives.
arXiv Detail & Related papers (2026-03-03T05:44:47Z)
Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey [27.467732819969935]
Out-of-distribution detection (OOD) is a pivotal task for real-world applications that trains models to identify samples that are distributionally different from the in-distribution (ID) data during testing.<n>Recent advances in AI, particularly Vision-Language Models (VLMs) like CLIP, have revolutionized OOD detection by shifting from traditional unimodal image detectors to multimodal image-text detectors.<n>To better align with CLIP's cross-modal nature, we propose a new categorization framework rooted in both image and text modalities.
arXiv Detail & Related papers (2025-05-05T08:22:38Z)
Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection [71.93411099797308]
Out-of-distribution (OOD) samples are crucial when deploying machine learning models in open-world scenarios. We propose to tackle this constraint by leveraging the expert knowledge and reasoning capability of large language models (LLM) to potential Outlier Exposure, termed EOE. EOE can be generalized to different tasks, including far, near, and fine-language OOD detection. EOE achieves state-of-the-art performance across different OOD tasks and can be effectively scaled to the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-06-02T17:09:48Z)
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities [11.884004583641325]
We introduce the first-of-its-kind benchmark, MultiOOD, characterized by diverse dataset sizes and varying modality combinations. We first evaluate existing unimodal OOD detection algorithms on MultiOOD, observing that the mere inclusion of additional modalities yields substantial improvements. We introduce a novel outlier synthesis method, NP-Mix, which explores broader feature spaces by leveraging the information from nearest neighbor classes.
arXiv Detail & Related papers (2024-05-27T17:59:02Z)
ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection [47.16254775587534]
We propose a novel OOD detection framework that discovers idlike outliers using CLIP citeDBLP:conf/icml/RadfordKHRGASAM21. Benefiting from the powerful CLIP, we only need a small number of ID samples to learn the prompts of the model. Our method achieves superior few-shot learning performance on various real-world image datasets.
arXiv Detail & Related papers (2023-11-26T09:06:40Z)
Exploring Large Language Models for Multi-Modal Out-of-Distribution Detection [67.68030805755679]
Large language models (LLMs) encode a wealth of world knowledge and can be prompted to generate descriptive features for each class. In this paper, we propose to apply world knowledge to enhance OOD detection performance through selective generation from LLMs.
arXiv Detail & Related papers (2023-10-12T04:14:28Z)
From Global to Local: Multi-scale Out-of-distribution Detection [129.37607313927458]
Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection. We propose Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details.
arXiv Detail & Related papers (2023-08-20T11:56:25Z)
General-Purpose Multi-Modal OOD Detection Framework [5.287829685181842]
Out-of-distribution (OOD) detection identifies test samples that differ from the training data, which is critical to ensuring the safety and reliability of machine learning (ML) systems. We propose a general-purpose weakly-supervised OOD detection framework, called WOOD, that combines a binary classifier and a contrastive learning component. We evaluate the proposed WOOD model on multiple real-world datasets, and the experimental results demonstrate that the WOOD model outperforms the state-of-the-art methods for multi-modal OOD detection.
arXiv Detail & Related papers (2023-07-24T18:50:49Z)
Out-of-Domain Intent Detection Considering Multi-Turn Dialogue Contexts [91.43701971416213]
We introduce a context-aware OOD intent detection (Caro) framework to model multi-turn contexts in OOD intent detection tasks. Caro establishes state-of-the-art performances on multi-turn OOD detection tasks by improving the F1-OOD score of over $29%$ compared to the previous best method.
arXiv Detail & Related papers (2023-05-05T01:39:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.