Open-Vocabulary DETR with Conditional Matching
- URL: http://arxiv.org/abs/2203.11876v1
- Date: Tue, 22 Mar 2022 16:54:52 GMT
- Title: Open-Vocabulary DETR with Conditional Matching
- Authors: Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy
- Abstract summary: OV-DETR is an open-vocabulary detector based on DETR.
It can detect any object given its class name or an exemplar image.
It achieves non-trivial improvements over current state of the arts.
- Score: 86.1530128487077
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Open-vocabulary object detection, which is concerned with the problem of
detecting novel objects guided by natural language, has gained increasing
attention from the community. Ideally, we would like to extend an
open-vocabulary detector such that it can produce bounding box predictions
based on user inputs in form of either natural language or exemplar image. This
offers great flexibility and user experience for human-computer interaction. To
this end, we propose a novel open-vocabulary detector based on DETR -- hence
the name OV-DETR -- which, once trained, can detect any object given its class
name or an exemplar image. The biggest challenge of turning DETR into an
open-vocabulary detector is that it is impossible to calculate the
classification cost matrix of novel classes without access to their labeled
images. To overcome this challenge, we formulate the learning objective as a
binary matching one between input queries (class name or exemplar image) and
the corresponding objects, which learns useful correspondence to generalize to
unseen queries during testing. For training, we choose to condition the
Transformer decoder on the input embeddings obtained from a pre-trained
vision-language model like CLIP, in order to enable matching for both text and
image queries. With extensive experiments on LVIS and COCO datasets, we
demonstrate that our OV-DETR -- the first end-to-end Transformer-based
open-vocabulary detector -- achieves non-trivial improvements over current
state of the arts.
Related papers
- End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories.
Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories.
We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision [22.493305132568477]
Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained.
We propose OV-DQUO, an textbfOpen-textbfVocabulary DETR with textbfDenoising text textbfQuery training and open-world textbfObjects supervision.
arXiv Detail & Related papers (2024-05-28T07:33:27Z) - Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically.
Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images.
We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.