Related papers: Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

URL: http://arxiv.org/abs/2202.03382v1
Date: Mon, 7 Feb 2022 17:59:04 GMT
Title: Corrupted Image Modeling for Self-Supervised Visual Pre-Training
Authors: Yuxin Fang, Li Dong, Hangbo Bao, Xinggang Wang, Furu Wei
Abstract summary: We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
Score: 103.99311611776697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation. For example, 300-epoch CIM pre-trained vanilla ViT-Base/16 and ResNet-50 obtain 83.3 and 80.6 Top-1 fine-tuning accuracy on ImageNet-1K image classification respectively.

Related papers

Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT) MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies. Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Centroid-centered Modeling for Efficient Vision Transformer Pre-training [44.24223088955106]
Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT) Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model. Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.
arXiv Detail & Related papers (2023-03-08T15:34:57Z)
FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images. It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z)
Co-training $2^L$ Submodels for Visual Recognition [67.02999567435626]
Submodel co-training is a regularization method related to co-training, self-distillation and depth. We show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation.
arXiv Detail & Related papers (2022-12-09T14:38:09Z)
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z)
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z)
mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning. We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z)
Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively. We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z)
BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers. Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens. We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z)
Rethinking CNN Models for Audio Classification [20.182928938110923]
ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. We systematically study how much of pretrained weights is useful for learning spectrograms. We show that for a given standard model using pretrained weights is better than using randomly Dense weights.
arXiv Detail & Related papers (2020-07-22T01:31:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.