Prompt-Guided Transformers for End-to-End Open-Vocabulary Object
Detection
- URL: http://arxiv.org/abs/2303.14386v1
- Date: Sat, 25 Mar 2023 07:31:08 GMT
- Title: Prompt-Guided Transformers for End-to-End Open-Vocabulary Object
Detection
- Authors: Hwanjun Song, Jihwan Bang
- Abstract summary: Prompt-OVD is an efficient and effective framework for open-vocabulary object detection.
It uses class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in both base and novel classes.
Our experiments on the OV-COCO and OVLVIS datasets demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference speed.
- Score: 10.482805367361818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt-OVD is an efficient and effective framework for open-vocabulary object
detection that utilizes class embeddings from CLIP as prompts, guiding the
Transformer decoder to detect objects in both base and novel classes.
Additionally, our novel RoI-based masked attention and RoI pruning techniques
help leverage the zero-shot classification ability of the Vision
Transformer-based CLIP, resulting in improved detection performance at minimal
computational cost. Our experiments on the OV-COCO and OVLVIS datasets
demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference
speed than the first end-to-end open-vocabulary detection method (OV-DETR),
while also achieving higher APs than four two-stage-based methods operating
within similar inference time ranges. Code will be made available soon.
Related papers
- C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.
Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - OpenSD: Unified Open-Vocabulary Segmentation and Detection [24.08879095731279]
We present a universal transformer-based framework, abbreviated as OpenSD, to handle open-vocabulary segmentation and detection tasks.
To better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain.
The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings.
arXiv Detail & Related papers (2023-12-10T08:51:34Z) - Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning [13.667326007851674]
We propose CastDet, a CLIP-activated student-teacher open-vocabulary object detection framework.
Our approach boosts not only novel object proposals but also classification.
Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-11-20T10:26:04Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - Open-Vocabulary DETR with Conditional Matching [86.1530128487077]
OV-DETR is an open-vocabulary detector based on DETR.
It can detect any object given its class name or an exemplar image.
It achieves non-trivial improvements over current state of the arts.
arXiv Detail & Related papers (2022-03-22T16:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.