From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
- URL: http://arxiv.org/abs/2411.18207v3
- Date: Fri, 21 Mar 2025 03:09:27 GMT
- Title: From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
- Authors: Zizhao Li, Zhengkang Xiang, Joseph West, Kourosh Khoshelham,
- Abstract summary: Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary.<n>OVD relies on accurate prompts provided by an oracle'', which limits their use in critical applications such as driving scene perception.<n>We propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects.
- Score: 0.6262268096839562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Open-World Object Detection with Instance Representation Learning [1.8749305679160366]
We propose a method to train an object detector that can both detect novel objects and extract semantically rich features in open-world conditions.
Our method learns a robust and generalizable feature space, outperforming other OWOD-based feature extraction methods.
arXiv Detail & Related papers (2024-09-24T13:13:34Z) - Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection [101.15777242546649]
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories.
Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection.
We present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge.
arXiv Detail & Related papers (2024-06-01T17:32:26Z) - Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way.
Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z) - SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector [8.956773268679811]
We specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector.
We observe that the combination of a simple textbfknowledge distillation approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection.
We propose two benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world.
arXiv Detail & Related papers (2023-12-14T04:47:20Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Open-World Object Detection via Discriminative Class Prototype Learning [4.055884768256164]
Open-world object detection (OWOD) is a challenging problem that combines object detection with incremental learning and open-set learning.
We propose a novel and efficient OWOD solution from a prototype perspective, which we call OCPL: Open-world object detection via discnative OCPL: Open-world object detection via discriminative OCPL: Open-world object detection via discriminative OCPL: Open-world object detection via discriminative OCPL: Open-world object detection via discriminative OCPL: Open-world object detection via discriminative OCPL: Open-world object detection via
arXiv Detail & Related papers (2023-02-23T03:05:04Z) - Open World DETR: Transformer based Open World Object Detection [60.64535309016623]
We propose a two-stage training approach named Open World DETR for open world object detection based on Deformable DETR.
We fine-tune the class-specific components of the model with a multi-view self-labeling strategy and a consistency constraint.
Our proposed method outperforms other state-of-the-art open world object detection methods by a large margin.
arXiv Detail & Related papers (2022-12-06T13:39:30Z) - Towards Open-Set Object Detection and Discovery [38.81806249664884]
We present a new task, namely Open-Set Object Detection and Discovery (OSODD)
We propose a two-stage method that first uses an open-set object detector to predict both known and unknown objects.
Then, we study the representation of predicted objects in an unsupervised manner and discover new categories from the set of unknown objects.
arXiv Detail & Related papers (2022-04-12T08:07:01Z) - Contrastive Object Detection Using Knowledge Graph Embeddings [72.17159795485915]
We compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs.
We propose a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.
arXiv Detail & Related papers (2021-12-21T17:10:21Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.