Region-centric Image-Language Pretraining for Open-Vocabulary Detection
- URL: http://arxiv.org/abs/2310.00161v2
- Date: Fri, 19 Jul 2024 02:11:04 GMT
- Title: Region-centric Image-Language Pretraining for Open-Vocabulary Detection
- Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo,
- Abstract summary: We present a new open-vocabulary detection approach based on region-centric image-language pretraining.
At the pretraining phase, we incorporate the detector architecture on top of the classification backbone.
Our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues.
- Score: 39.17829005627821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we incorporate the detector architecture on top of the classification backbone, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from large-scale image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 37.6 mask APr using the common ViT-L backbone and public LAION dataset, and 40.5 mask APr using the DataComp-1B dataset, significantly outperforming the best existing approach by +3.7 mask APr at system level. On the COCO benchmark, we achieve very competitive 39.6 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where it demonstrates notable improvement over the baseline. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline.
Related papers
- ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision.
This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline.
Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z) - Simple Image-level Classification Improves Open-vocabulary Object
Detection [27.131298903486474]
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained.
Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training.
These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale
arXiv Detail & Related papers (2023-12-16T13:06:15Z) - Contrastive Feature Masking Open-Vocabulary Vision Transformer [44.03247177599605]
Contrastive Feature Masking Vision Transformer (CFM-ViT)
An image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD)
arXiv Detail & Related papers (2023-09-02T01:12:48Z) - Region-Aware Pretraining for Open-Vocabulary Object Detection with
Vision Transformers [44.03247177599605]
Region-aware Open-vocabulary Vision Transformers (RO-ViT)
We present a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
arXiv Detail & Related papers (2023-05-11T17:53:29Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.