DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
- URL: http://arxiv.org/abs/2404.09216v1
- Date: Sun, 14 Apr 2024 11:01:44 GMT
- Title: DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
- Authors: Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu,
- Abstract summary: We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
- Score: 111.68263493302499
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.
Related papers
- Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection [68.26282316080558]
Current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories.
We introduce Prova, a prototype classifier for vast-vocabulary object detection.
arXiv Detail & Related papers (2024-12-23T18:57:43Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning [13.667326007851674]
We propose CastDet, a CLIP-activated student-teacher open-vocabulary object detection framework.
Our approach boosts not only novel object proposals but also classification.
Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-11-20T10:26:04Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Improving Point Cloud Semantic Segmentation by Learning 3D Object
Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving.
Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes.
We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.