MVP: Multimodality-guided Visual Pre-training
- URL: http://arxiv.org/abs/2203.05175v1
- Date: Thu, 10 Mar 2022 06:11:20 GMT
- Title: MVP: Multimodality-guided Visual Pre-training
- Authors: Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian
- Abstract summary: Masked image modeling (MIM) has become a promising direction for visual pre-training.
In this paper, we introduce guidance from other modalities and validate that such additional knowledge leads to impressive gains for visual pre-training.
The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs.
- Score: 215.11351064601303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, masked image modeling (MIM) has become a promising direction for
visual pre-training. In the context of vision transformers, MIM learns
effective visual representation by aligning the token-level features with a
pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as
the tokenizer). In this paper, we go one step further by introducing guidance
from other modalities and validating that such additional knowledge leads to
impressive gains for visual pre-training. The proposed approach is named
Multimodality-guided Visual Pre-training (MVP), in which we replace the
tokenizer with the vision branch of CLIP, a vision-language model pre-trained
on 400 million image-text pairs. We demonstrate the effectiveness of MVP by
performing standard experiments, i.e., pre-training the ViT models on ImageNet
and fine-tuning them on a series of downstream visual recognition tasks. In
particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4% mIoU
on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an
impressive margin of 6.8%.
Related papers
- Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale [46.952339726872374]
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale.
EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches.
We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute.
arXiv Detail & Related papers (2022-11-14T18:59:52Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - VC-GPT: Visual Conditioned GPT for End-to-End Generative
Vision-and-Language Pre-training [9.511101155155957]
A vision-and-language pre-training model (VLMs) has achieved tremendous success in the cross-modal area, but most of them require millions of parallel image-caption data for pre-training.
In this work, we focus on reducing such need for generative vision-and-language pre-training by taking advantage of the visual pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2) as decoder.
arXiv Detail & Related papers (2022-01-30T04:44:54Z) - PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.