BEiT: BERT Pre-Training of Image Transformers
- URL: http://arxiv.org/abs/2106.08254v1
- Date: Tue, 15 Jun 2021 16:02:37 GMT
- Title: BEiT: BERT Pre-Training of Image Transformers
- Authors: Hangbo Bao, Li Dong, Furu Wei
- Abstract summary: We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
- Score: 43.704968112586876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a self-supervised vision representation model BEiT, which stands
for Bidirectional Encoder representation from Image Transformers. Following
BERT developed in the natural language processing area, we propose a masked
image modeling task to pretrain vision Transformers. Specifically, each image
has two views in our pre-training, i.e, image patches (such as 16x16 pixels),
and visual tokens (i.e., discrete tokens). We first "tokenize" the original
image into visual tokens. Then we randomly mask some image patches and fed them
into the backbone Transformer. The pre-training objective is to recover the
original visual tokens based on the corrupted image patches. After pre-training
BEiT, we directly fine-tune the model parameters on downstream tasks by
appending task layers upon the pretrained encoder. Experimental results on
image classification and semantic segmentation show that our model achieves
competitive results with previous pre-training methods. For example, base-size
BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming
from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size
BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with
supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models
are available at https://aka.ms/beit.
Related papers
- Bridging The Gaps Between Token Pruning and Full Pre-training via Masked
Fine-tuning [19.391064062033436]
Dynamic vision transformers are used to accelerate inference by pruning tokens redundant.
Current base models usually adopt full image training, using full images as inputs and keeping the whole feature maps through the forward process.
Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models and token pruning based dynamic vision transformers.
arXiv Detail & Related papers (2023-10-26T06:03:18Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - MILAN: Masked Image Pretraining on Language Assisted Representation [30.24762638226569]
In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN.
Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals.
Experimental results demonstrate that MILAN delivers higher accuracy than the previous works.
arXiv Detail & Related papers (2022-08-11T21:58:36Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.