DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment
- URL: http://arxiv.org/abs/2304.04514v1
- Date: Mon, 10 Apr 2023 11:08:15 GMT
- Title: DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment
- Authors: Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li,
Hang Xu
- Abstract summary: DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
- Score: 104.54362490182335
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents DetCLIPv2, an efficient and scalable training framework
that incorporates large-scale image-text pairs to achieve open-vocabulary
object detection (OVD). Unlike previous OVD frameworks that typically rely on a
pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via
a pseudo labeling process, DetCLIPv2 directly learns the fine-grained
word-region alignment from massive image-text pairs in an end-to-end manner. To
accomplish this, we employ a maximum word-region similarity between region
proposals and textual words to guide the contrastive objective. To enable the
model to gain localization capability while learning broad concepts, DetCLIPv2
is trained with a hybrid supervision from detection, grounding and image-text
pair data under a unified data formulation. By jointly training with an
alternating scheme and adopting low-resolution input for image-text pairs,
DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2
utilizes 13X more image-text pairs than DetCLIP with a similar training time
and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2
demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2
with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which
outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP,
respectively, and even beats its fully-supervised counterpart by a large
margin.
Related papers
- CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data [40.88256210436378]
We present a novel weakly supervised pre-training of vision models on web-scale image-text data.
The proposed method reframes pre-training on image-text data as a classification task.
It achieves a remarkable $2.7times$ acceleration in training speed compared to contrastive learning on web-scale data.
arXiv Detail & Related papers (2024-04-24T05:13:28Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.