Learning to Prompt for Open-Vocabulary Object Detection with
Vision-Language Model
- URL: http://arxiv.org/abs/2203.14940v1
- Date: Mon, 28 Mar 2022 17:50:26 GMT
- Title: Learning to Prompt for Open-Vocabulary Object Detection with
Vision-Language Model
- Authors: Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, Guoqi Li
- Abstract summary: We introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection.
We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector.
Experimental results show that our DetPro outperforms the baseline ViLD in all settings.
- Score: 34.85604521903056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, vision-language pre-training shows great potential in
open-vocabulary object detection, where detectors trained on base classes are
devised for detecting new classes. The class text embedding is firstly
generated by feeding prompts to the text encoder of a pre-trained
vision-language model. It is then used as the region classifier to supervise
the training of a detector. The key element that leads to the success of this
model is the proper prompt, which requires careful words tuning and ingenious
design. To avoid laborious prompt engineering, there are some prompt
representation learning methods being proposed for the image classification
task, which however can only be sub-optimal solutions when applied to the
detection task. In this paper, we introduce a novel method, detection prompt
(DetPro), to learn continuous prompt representations for open-vocabulary object
detection based on the pre-trained vision-language model. Different from the
previous classification-oriented methods, DetPro has two highlights: 1) a
background interpretation scheme to include the proposals in image background
into the prompt training; 2) a context grading scheme to separate proposals in
image foreground for tailored prompt training. We assemble DetPro with ViLD, a
recent state-of-the-art open-world object detector, and conduct experiments on
the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365
datasets. Experimental results show that our DetPro outperforms the baseline
ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the
novel classes of LVIS. Code and models are available at
https://github.com/dyabel/detpro.
Related papers
- Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Localized Vision-Language Matching for Open-vocabulary Object Detection [41.98293277826196]
We propose an open-world object detection method that learns to detect novel object classes along with a given set of known classes.
It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels.
We show that a simple language model fits better than a large contextualized language model for detecting novel objects.
arXiv Detail & Related papers (2022-05-12T15:34:37Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.