mc-BEiT: Multi-choice Discretization for Image BERT Pre-training
- URL: http://arxiv.org/abs/2203.15371v1
- Date: Tue, 29 Mar 2022 09:08:18 GMT
- Title: mc-BEiT: Multi-choice Discretization for Image BERT Pre-training
- Authors: Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan
- Abstract summary: Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
- Score: 52.04866462439979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image BERT pre-training with masked image modeling (MIM) becomes a popular
practice to cope with self-supervised representation learning. A seminal work,
BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing
the continuous visual signals into discrete vision tokens using a pre-learned
dVAE. Despite a feasible solution, the improper discretization hinders further
improvements of image pre-training. Since image discretization has no
ground-truth answers, we believe that the masked patch should not be assigned
with a unique token id even if a better tokenizer can be obtained. In this
work, we introduce an improved BERT-style image pre-training method, namely
mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice
training objectives. Specifically, the multi-choice supervision for the masked
image patches is formed by the soft probability vectors of the discrete token
ids, which are predicted by the off-the-shelf image tokenizer and further
refined by high-level inter-patch perceptions resorting to the observation that
similar patches should share their choices. Extensive experiments on
classification, segmentation, and detection tasks demonstrate the superiority
of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning
accuracy on ImageNet-1K classification, 51.2% mIOU on ADE20K semantic
segmentation, 51.2% AP^b and 44.3% AP^m of object detection and instance
segmentation on COCO, outperforming the competitive counterparts.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - iBOT: Image BERT Pre-Training with Online Tokenizer [23.997853010642046]
We study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer.
We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer.
We show the prominence of iBOT by achieving an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy evaluated on ImageNet-1K.
arXiv Detail & Related papers (2021-11-15T15:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.