CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting
and Anchor Pre-Matching
- URL: http://arxiv.org/abs/2303.13076v1
- Date: Thu, 23 Mar 2023 07:13:57 GMT
- Title: CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting
and Anchor Pre-Matching
- Authors: Xiaoshi Wu, Feng Zhu, Rui Zhao, Hongsheng Li
- Abstract summary: We propose a framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching.
CORA achieves 41.7 AP50 on the COCO OVD benchmark, and 28.1 box APr on the LVIS OVD benchmark.
- Score: 36.31910430275781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary detection (OVD) is an object detection task aiming at
detecting objects from novel categories beyond the base categories on which the
detector is trained. Recent OVD methods rely on large-scale visual-language
pre-trained models, such as CLIP, for recognizing novel objects. We identify
the two core obstacles that need to be tackled when incorporating these models
into detector training: (1) the distribution mismatch that happens when
applying a VL-model trained on whole images to region recognition tasks; (2)
the difficulty of localizing objects of unseen classes. To overcome these
obstacles, we propose CORA, a DETR-style framework that adapts CLIP for
Open-vocabulary detection by Region prompting and Anchor pre-matching. Region
prompting mitigates the whole-to-region distribution gap by prompting the
region features of the CLIP-based region classifier. Anchor pre-matching helps
learning generalizable object localization by a class-aware matching mechanism.
We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel
classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting
to extra training data. When extra training data is available, we train
CORA$^+$ on both ground-truth base-category annotations and additional pseudo
bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO
OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.
Related papers
- Region-centric Image-Language Pretraining for Open-Vocabulary Detection [39.17829005627821]
We present a new open-vocabulary detection approach based on region-centric image-language pretraining.
At the pretraining phase, we incorporate the detector architecture on top of the classification backbone.
Our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues.
arXiv Detail & Related papers (2023-09-29T21:56:37Z) - ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection [52.16237548064387]
Few-shot object detection (FSOD) identifies objects from extremely few annotated samples.
Most existing FSOD methods, recently, apply the two-stage learning paradigm, which transfers the knowledge learned from abundant base classes to assist the few-shot detectors by learning the global features.
We propose an Extensible Co-Existing Attention (ECEA) module to enable the model to infer the global object according to the local parts.
arXiv Detail & Related papers (2023-09-15T06:55:43Z) - EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [28.983503845298824]
We propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction.
In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics.
arXiv Detail & Related papers (2023-09-03T12:04:14Z) - What Makes Good Open-Vocabulary Detector: A Disassembling Perspective [6.623703413255309]
Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary.
Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part.
We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly.
arXiv Detail & Related papers (2023-09-01T03:03:50Z) - F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language
Models [54.21757555804668]
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining.
arXiv Detail & Related papers (2022-09-30T17:59:52Z) - Refine and Represent: Region-to-Object Representation Learning [55.70715883351945]
We present Region-to-Object Representation Learning (R2O) which unifies region-based and object-centric pretraining.
R2O operates by training an encoder to dynamically refine region-based segments into object-centric masks.
After pretraining on ImageNet, R2O models are able to surpass existing state-of-the-art in unsupervised object segmentation.
arXiv Detail & Related papers (2022-08-25T01:44:28Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Generalized Focal Loss: Learning Qualified and Distributed Bounding
Boxes for Dense Object Detection [85.53263670166304]
One-stage detector basically formulates object detection as dense classification and localization.
Recent trend for one-stage detectors is to introduce an individual prediction branch to estimate the quality of localization.
This paper delves into the representations of the above three fundamental elements: quality estimation, classification and localization.
arXiv Detail & Related papers (2020-06-08T07:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.