OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision
- URL: http://arxiv.org/abs/2405.17913v2
- Date: Wed, 21 Aug 2024 02:40:34 GMT
- Title: OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision
- Authors: Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang, Yong Xu,
- Abstract summary: Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained.
We propose OV-DQUO, an textbfOpen-textbfVocabulary DETR with textbfDenoising text textbfQuery training and open-world textbfObjects supervision.
- Score: 22.493305132568477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR with \textbf{D}enoising text \textbf{Q}uery training and open-world \textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at \url{https://github.com/xiaomoguhz/OV-DQUO}
Related papers
- OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection [101.15777242546649]
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories.
Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection.
We present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge.
arXiv Detail & Related papers (2024-06-01T17:32:26Z) - Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically.
Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images.
We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z) - Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization [63.66349334291372]
We propose a framework with Meta prompt and Instance Contrastive learning (MIC) schemes.
Firstly, we simulate a novel-class-emerging scenario to help the prompt that learns class and background prompts generalize to novel classes.
Secondly, we design an instance-level contrastive strategy to promote intra-class compactness and inter-class separation, which benefits generalization of the detector to novel class objects.
arXiv Detail & Related papers (2024-03-14T14:25:10Z) - LP-OVOD: Open-Vocabulary Object Detection by Linear Probing [8.202076059391315]
An object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training.
A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label.
This method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information.
We propose a novel method, LP-OVOD, that discards low-quality boxes by training a
arXiv Detail & Related papers (2023-10-26T02:37:08Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Open Vocabulary Object Detection with Proposal Mining and Prediction
Equalization [73.14053674836838]
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary.
Recent work resorts to the rich knowledge in pre-trained vision-language models.
We present MEDet, a novel OVD framework with proposal mining and prediction equalization.
arXiv Detail & Related papers (2022-06-22T14:30:41Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z) - Learning to Prompt for Open-Vocabulary Object Detection with
Vision-Language Model [34.85604521903056]
We introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection.
We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector.
Experimental results show that our DetPro outperforms the baseline ViLD in all settings.
arXiv Detail & Related papers (2022-03-28T17:50:26Z) - Open-Vocabulary DETR with Conditional Matching [86.1530128487077]
OV-DETR is an open-vocabulary detector based on DETR.
It can detect any object given its class name or an exemplar image.
It achieves non-trivial improvements over current state of the arts.
arXiv Detail & Related papers (2022-03-22T16:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.