BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
- URL: http://arxiv.org/abs/2208.06366v1
- Date: Fri, 12 Aug 2022 16:48:10 GMT
- Title: BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
- Authors: Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei
- Abstract summary: We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
- Score: 117.79456335844439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling (MIM) has demonstrated impressive results in
self-supervised representation learning by recovering corrupted image patches.
However, most methods still operate on low-level image pixels, which hinders
the exploitation of high-level semantics for representation models. In this
study, we propose to use a semantic-rich visual tokenizer as the reconstruction
target for masked prediction, providing a systematic way to promote MIM from
pixel-level to semantic-level. Specifically, we introduce vector-quantized
knowledge distillation to train the tokenizer, which discretizes a continuous
semantic space to compact codes. We then pretrain vision Transformers by
predicting the original visual tokens for the masked image patches. Moreover,
we encourage the model to explicitly aggregate patch information into a global
image representation, which facilities linear probing. Experiments on image
classification and semantic segmentation show that our approach outperforms all
compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves
85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear
probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K
(224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. The
code and pretrained models are available at https://aka.ms/beit.
Related papers
- CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - MILAN: Masked Image Pretraining on Language Assisted Representation [30.24762638226569]
In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN.
Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals.
Experimental results demonstrate that MILAN delivers higher accuracy than the previous works.
arXiv Detail & Related papers (2022-08-11T21:58:36Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - SimMIM: A Simple Framework for Masked Image Modeling [29.015777125540613]
This paper presents Sim, a simple framework for masked image modeling.
We study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance.
We also leverage this approach to facilitate the training of a 3B model, that by $40times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:45Z) - iBOT: Image BERT Pre-Training with Online Tokenizer [23.997853010642046]
We study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer.
We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer.
We show the prominence of iBOT by achieving an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy evaluated on ImageNet-1K.
arXiv Detail & Related papers (2021-11-15T15:18:05Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.