Related papers: Open-Set Recognition in the Age of Vision-Language Models

Open-Set Recognition in the Age of Vision-Language Models

URL: http://arxiv.org/abs/2403.16528v2
Date: Fri, 19 Jul 2024 14:16:31 GMT
Title: Open-Set Recognition in the Age of Vision-Language Models
Authors: Dimity Miller, Niko Sünderhauf, Alex Kenna, Keita Mason,
Abstract summary: We investigate whether vision-language models (VLMs) for open-vocabulary perception inherently open-set models because they are trained on internet-scale datasets. We find they introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions. We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance.
Score: 9.306738687897889
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Are vision-language models (VLMs) for open-vocabulary perception inherently open-set models because they are trained on internet-scale datasets? We answer this question with a clear no - VLMs introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions. We systematically evaluate VLMs for open-set recognition and find they frequently misclassify objects not contained in their query set, leading to alarmingly low precision when tuned for high recall and vice versa. We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance. We establish a revised definition of the open-set problem for the age of VLMs, define a new benchmark and evaluation protocol to facilitate standardised evaluation and research in this important area, and evaluate promising baseline approaches based on predictive uncertainty and dedicated negative embeddings on a range of open-vocabulary VLM classifiers and object detectors.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion [7.719330752075467]
We present an open-world object detection framework capable of discovering unseen objects while achieving favorable performance.<n>By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode.<n> Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.
arXiv Detail & Related papers (2025-05-25T05:44:02Z)
Active Learning for Vision-Language Models [29.309503214127016]
We propose a novel active learning (AL) framework that enhances the zero-shot classification performance of vision-language models (VLMs) Our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets.
arXiv Detail & Related papers (2024-10-29T16:25:50Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection [25.506346503624894]
We propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input.
arXiv Detail & Related papers (2023-08-25T04:54:32Z)
Multimodal Prompt Retrieval for Generative Visual Question Answering [9.973591610073006]
We propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting.
arXiv Detail & Related papers (2023-06-30T14:06:13Z)
Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning [10.57079240576682]
We introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings.
arXiv Detail & Related papers (2023-03-20T13:38:29Z)
OpenAUC: Towards AUC-Oriented Open-Set Recognition [151.5072746015253]
Traditional machine learning follows a close-set assumption that the training and test set share the same label space. Open-Set Recognition (OSR) aims to make correct predictions on both close-set samples and open-set samples. To fix these issues, we propose a novel metric named OpenAUC.
arXiv Detail & Related papers (2022-10-22T08:54:15Z)
Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes. We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy. We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z)
A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset. We measure consensus between answers generated by the model and a set of relevant answers. We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.