What Makes Good Open-Vocabulary Detector: A Disassembling Perspective
- URL: http://arxiv.org/abs/2309.00227v1
- Date: Fri, 1 Sep 2023 03:03:50 GMT
- Title: What Makes Good Open-Vocabulary Detector: A Disassembling Perspective
- Authors: Jincheng Li, Chunyu Xie, Xiaoyu Wu, Bin Wang, Dawei Leng
- Abstract summary: Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary.
Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part.
We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly.
- Score: 6.623703413255309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to
localize and recognize unseen objects defined by an unbounded vocabulary. This
is challenging since traditional detectors can only learn from pre-defined
categories and thus fail to detect and localize objects out of pre-defined
vocabulary. To handle the challenge, OVD leverages pre-trained cross-modal VLM,
such as CLIP, ALIGN, etc. Previous works mainly focus on the open vocabulary
classification part, with less attention on the localization part. We argue
that for a good OVD detector, both classification and localization should be
parallelly studied for the novel object categories. We show in this work that
improving localization as well as cross-modal classification complement each
other, and compose a good OVD detector jointly. We analyze three families of
OVD methods with different design emphases. We first propose a vanilla
method,i.e., cropping a bounding box obtained by a localizer and resizing it
into the CLIP. We next introduce another approach, which combines a standard
two-stage object detector with CLIP. A two-stage object detector includes a
visual backbone, a region proposal network (RPN), and a region of interest
(RoI) head. We decouple RPN and ROI head (DRR) and use RoIAlign to extract
meaningful features. In this case, it avoids resizing objects. To further
accelerate the training time and reduce the model parameters, we couple RPN and
ROI head (CRR) as the third approach. We conduct extensive experiments on these
three types of approaches in different settings. On the OVD-COCO benchmark, DRR
obtains the best performance and achieves 35.8 Novel AP$_{50}$, an absolute 2.8
gain over the previous state-of-the-art (SOTA). For OVD-LVIS, DRR surpasses the
previous SOTA by 1.9 AP$_{50}$ in rare categories. We also provide an object
detection dataset called PID and provide a baseline on PID.
Related papers
- Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection [44.92009038111696]
Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes.
We propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task.
With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference.
arXiv Detail & Related papers (2024-07-12T02:34:11Z) - Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments [67.83787474506073]
We tackle the limitations of current LiDAR-based 3D object detection systems.
We introduce a universal textscFind n' Propagate approach for 3D OV tasks.
We achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes.
arXiv Detail & Related papers (2024-03-20T12:51:30Z) - Described Object Detection: Liberating Object Detection with Flexible
Expressions [19.392927971139652]
We advance Open-Vocabulary object Detection (OVD) and Referring Expression (REC) to Described Object Detection (DOD)
In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD.
This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission.
arXiv Detail & Related papers (2023-07-24T14:06:54Z) - Open-Vocabulary Point-Cloud Object Detection without 3D Annotation [62.18197846270103]
The goal of open-vocabulary 3D point-cloud detection is to identify novel objects based on arbitrary textual descriptions.
We develop a point-cloud detector that can learn a general representation for localizing various objects.
We also propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text.
arXiv Detail & Related papers (2023-04-03T08:22:02Z) - CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting
and Anchor Pre-Matching [36.31910430275781]
We propose a framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching.
CORA achieves 41.7 AP50 on the COCO OVD benchmark, and 28.1 box APr on the LVIS OVD benchmark.
arXiv Detail & Related papers (2023-03-23T07:13:57Z) - Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [76.5120397167247]
We present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training.
The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization.
Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g.
arXiv Detail & Related papers (2023-03-09T18:52:16Z) - F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language
Models [54.21757555804668]
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining.
arXiv Detail & Related papers (2022-09-30T17:59:52Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.