BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
- URL: http://arxiv.org/abs/2208.06366v1
- Date: Fri, 12 Aug 2022 16:48:10 GMT
- Title: BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
- Authors: Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei
- Abstract summary: We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
- Score: 117.79456335844439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling (MIM) has demonstrated impressive results in
self-supervised representation learning by recovering corrupted image patches.
However, most methods still operate on low-level image pixels, which hinders
the exploitation of high-level semantics for representation models. In this
study, we propose to use a semantic-rich visual tokenizer as the reconstruction
target for masked prediction, providing a systematic way to promote MIM from
pixel-level to semantic-level. Specifically, we introduce vector-quantized
knowledge distillation to train the tokenizer, which discretizes a continuous
semantic space to compact codes. We then pretrain vision Transformers by
predicting the original visual tokens for the masked image patches. Moreover,
we encourage the model to explicitly aggregate patch information into a global
image representation, which facilities linear probing. Experiments on image
classification and semantic segmentation show that our approach outperforms all
compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves
85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear
probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K
(224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. The
code and pretrained models are available at https://aka.ms/beit.
Related papers
- Centroid-centered Modeling for Efficient Vision Transformer Pre-training [109.18486172045701]
Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using Vision Transformer (ViT)
Our proposed approach, textbfCCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of tokenizer model.
Experiments show that the ViT-B model with only 300 epochs achieves 84.3% top-1 accuracy on ImageNet-1K classification and 51.6% on ADE20K semantic segmentation.
arXiv Detail & Related papers (2023-03-08T15:34:57Z) - Good helper is around you: Attention-driven Masked Image Modeling [12.961634455083775]
Masked image modeling (MIM) has shown a huge potential in self-supervised learning.
We propose textbfAttention-driven Masking and Throwing Strategy (AMT)
AMT improves the linear probing accuracy of MAE by $2.9% sim 5.9%$ on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K.
arXiv Detail & Related papers (2022-11-28T14:38:19Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - SimMIM: A Simple Framework for Masked Image Modeling [29.015777125540613]
This paper presents Sim, a simple framework for masked image modeling.
We study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance.
We also leverage this approach to facilitate the training of a 3B model, that by $40times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:45Z) - iBOT: Image BERT Pre-Training with Online Tokenizer [23.997853010642046]
We study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer.
We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer.
We show the prominence of iBOT by achieving an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy evaluated on ImageNet-1K.
arXiv Detail & Related papers (2021-11-15T15:18:05Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.