Corrupted Image Modeling for Self-Supervised Visual Pre-Training
- URL: http://arxiv.org/abs/2202.03382v1
- Date: Mon, 7 Feb 2022 17:59:04 GMT
- Title: Corrupted Image Modeling for Self-Supervised Visual Pre-Training
- Authors: Yuxin Fang, Li Dong, Hangbo Bao, Xinggang Wang, Furu Wei
- Abstract summary: We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
- Score: 103.99311611776697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Corrupted Image Modeling (CIM) for self-supervised visual
pre-training. CIM uses an auxiliary generator with a small trainable BEiT to
corrupt the input image instead of using artificial mask tokens, where some
patches are randomly selected and replaced with plausible alternatives sampled
from the BEiT output distribution. Given this corrupted image, an enhancer
network learns to either recover all the original image pixels, or predict
whether each visual token is replaced by a generator sample or not. The
generator and the enhancer are simultaneously trained and synergistically
updated. After pre-training, the enhancer can be used as a high-capacity visual
encoder for downstream tasks. CIM is a general and flexible visual pre-training
framework that is suitable for various network architectures. For the first
time, CIM demonstrates that both ViT and CNN can learn rich visual
representations using a unified, non-Siamese framework. Experimental results
show that our approach achieves compelling results in vision benchmarks, such
as ImageNet classification and ADE20K semantic segmentation. For example,
300-epoch CIM pre-trained vanilla ViT-Base/16 and ResNet-50 obtain 83.3 and
80.6 Top-1 fine-tuning accuracy on ImageNet-1K image classification
respectively.
Related papers
- Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms.
We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Centroid-centered Modeling for Efficient Vision Transformer Pre-training [44.24223088955106]
Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT)
Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model.
Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.
arXiv Detail & Related papers (2023-03-08T15:34:57Z) - FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z) - Co-training $2^L$ Submodels for Visual Recognition [67.02999567435626]
Submodel co-training is a regularization method related to co-training, self-distillation and depth.
We show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation.
arXiv Detail & Related papers (2022-12-09T14:38:09Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z) - Rethinking CNN Models for Audio Classification [20.182928938110923]
ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification.
We systematically study how much of pretrained weights is useful for learning spectrograms.
We show that for a given standard model using pretrained weights is better than using randomly Dense weights.
arXiv Detail & Related papers (2020-07-22T01:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.