Region-Aware Pretraining for Open-Vocabulary Object Detection with
Vision Transformers
- URL: http://arxiv.org/abs/2305.07011v4
- Date: Mon, 28 Aug 2023 07:29:03 GMT
- Title: Region-Aware Pretraining for Open-Vocabulary Object Detection with
Vision Transformers
- Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo
- Abstract summary: Region-aware Open-vocabulary Vision Transformers (RO-ViT)
We present a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
- Score: 44.03247177599605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a
contrastive image-text pretraining recipe to bridge the gap between image-level
pretraining and open-vocabulary object detection. At the pretraining phase, we
propose to randomly crop and resize regions of positional embeddings instead of
using the whole image positional embeddings. This better matches the use of
positional embeddings at region-level in the detection finetuning phase. In
addition, we replace the common softmax cross entropy loss in contrastive
learning with focal loss to better learn the informative yet difficult
examples. Finally, we leverage recent advances in novel object proposals to
improve open-vocabulary detection finetuning. We evaluate our full model on the
LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer.
RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best
existing approach by +7.8 points in addition to competitive zero-shot transfer
detection. Surprisingly, RO-ViT improves the image-level representation as well
and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr
image-text retrieval benchmarks, outperforming competitive approaches with
larger models.
Related papers
- Region-centric Image-Language Pretraining for Open-Vocabulary Detection [39.17829005627821]
We present a new open-vocabulary detection approach based on region-centric image-language pretraining.
At the pretraining phase, we incorporate the detector architecture on top of the classification backbone.
Our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues.
arXiv Detail & Related papers (2023-09-29T21:56:37Z) - Contrastive Feature Masking Open-Vocabulary Vision Transformer [44.03247177599605]
Contrastive Feature Masking Vision Transformer (CFM-ViT)
An image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD)
arXiv Detail & Related papers (2023-09-02T01:12:48Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Simple Open-Vocabulary Object Detection with Vision Transformers [51.57562920090721]
We propose a strong recipe for transferring image-text models to open-vocabulary object detection.
We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning.
We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection.
arXiv Detail & Related papers (2022-05-12T17:20:36Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.