Contrastive Feature Masking Open-Vocabulary Vision Transformer
- URL: http://arxiv.org/abs/2309.00775v1
- Date: Sat, 2 Sep 2023 01:12:48 GMT
- Title: Contrastive Feature Masking Open-Vocabulary Vision Transformer
- Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo
- Abstract summary: Contrastive Feature Masking Vision Transformer (CFM-ViT)
An image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD)
- Score: 44.03247177599605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an
image-text pretraining methodology that achieves simultaneous learning of
image- and region-level representation for open-vocabulary object detection
(OVD). Our approach combines the masked autoencoder (MAE) objective into the
contrastive learning objective to improve the representation for localization
tasks. Unlike standard MAE, we perform reconstruction in the joint image-text
embedding space, rather than the pixel space as is customary with the classical
MAE method, which causes the model to better learn region-level semantics.
Moreover, we introduce Positional Embedding Dropout (PED) to address scale
variation between image-text pretraining and detection finetuning by randomly
dropping out the positional embeddings during pretraining. PED improves
detection performance and enables the use of a frozen ViT backbone as a region
classifier, preventing the forgetting of open-vocabulary knowledge during
detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT
achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6
points and achieves better zero-shot detection transfer. Finally, CFM-ViT
acquires strong image-level representation, outperforming the state of the art
on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.
Related papers
- Boosting Open-Vocabulary Object Detection by Handling Background Samples [9.07525578809556]
We propose a novel approach to address the limitations of CLIP in handling background samples.
We introduce Partial Object Suppression (POS) to address the issue of misclassifying partial regions as foreground.
Our proposed model is capable of achieving performance enhancements across various open-vocabulary detectors.
arXiv Detail & Related papers (2024-10-11T09:15:50Z) - Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - Region-centric Image-Language Pretraining for Open-Vocabulary Detection [39.17829005627821]
We present a new open-vocabulary detection approach based on region-centric image-language pretraining.
At the pretraining phase, we incorporate the detector architecture on top of the classification backbone.
Our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues.
arXiv Detail & Related papers (2023-09-29T21:56:37Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - Region-Aware Pretraining for Open-Vocabulary Object Detection with
Vision Transformers [44.03247177599605]
Region-aware Open-vocabulary Vision Transformers (RO-ViT)
We present a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
arXiv Detail & Related papers (2023-05-11T17:53:29Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.