VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion
- URL: http://arxiv.org/abs/2505.18986v1
- Date: Sun, 25 May 2025 05:44:02 GMT
- Title: VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion
- Authors: Zhiwei Lin, Yongtao Wang,
- Abstract summary: We present an open-world object detection framework capable of discovering unseen objects while achieving favorable performance.<n>By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode.<n> Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.
- Score: 7.719330752075467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.
Related papers
- Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection [0.0]
Open World Object Detection is a challenging computer vision task.<n>Many methods have addressed this by using pseudo-labels for unknown objects.<n>The recently proposed Probabilistic Objectness transformer-based open-world detector (PROB) is a state-of-the-art model.
arXiv Detail & Related papers (2025-07-17T12:56:04Z) - Solving Instance Detection from an Open-World Perspective [14.438053802336947]
Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references.<n>Its open-world nature supports its broad applications from robotics to AR/VR but also presents significant challenges.
arXiv Detail & Related papers (2025-03-01T05:56:58Z) - From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects [0.6262268096839562]
Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary.<n>OVD relies on accurate prompts provided by an oracle'', which limits their use in critical applications such as driving scene perception.<n>We propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects.
arXiv Detail & Related papers (2024-11-27T10:33:51Z) - Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts [14.631774737903015]
Existing perception models achieve great success by learning from large amounts of labeled data, but they still struggle with open-world scenarios.
We present textiti.e., open-ended object detection, which discovers unseen objects without any object categories as inputs.
We show that our method surpasses the previous open-ended method on the object detection task and can provide additional instance segmentation masks.
arXiv Detail & Related papers (2024-10-08T12:15:08Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Open-Set Recognition in the Age of Vision-Language Models [9.306738687897889]
We investigate whether vision-language models (VLMs) for open-vocabulary perception inherently open-set models because they are trained on internet-scale datasets.
We find they introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions.
We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance.
arXiv Detail & Related papers (2024-03-25T08:14:22Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Open World DETR: Transformer based Open World Object Detection [60.64535309016623]
We propose a two-stage training approach named Open World DETR for open world object detection based on Deformable DETR.
We fine-tune the class-specific components of the model with a multi-view self-labeling strategy and a consistency constraint.
Our proposed method outperforms other state-of-the-art open world object detection methods by a large margin.
arXiv Detail & Related papers (2022-12-06T13:39:30Z) - OpenAUC: Towards AUC-Oriented Open-Set Recognition [151.5072746015253]
Traditional machine learning follows a close-set assumption that the training and test set share the same label space.
Open-Set Recognition (OSR) aims to make correct predictions on both close-set samples and open-set samples.
To fix these issues, we propose a novel metric named OpenAUC.
arXiv Detail & Related papers (2022-10-22T08:54:15Z) - Opening up Open-World Tracking [62.12659607088812]
We propose and study Open-World Tracking (OWT)
This paper is the formalization of the OWT task, along with an evaluation protocol and metric (OWTA)
We show that our Open-World Tracking Baseline, while performing well in the OWT setting, also achieves near state-of-the-art results on traditional closed-world benchmarks.
arXiv Detail & Related papers (2021-04-22T17:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.