MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of
Hierarchical Vision Transformers
- URL: http://arxiv.org/abs/2205.13137v4
- Date: Fri, 31 Mar 2023 09:26:28 GMT
- Title: MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of
Hierarchical Vision Transformers
- Authors: Jihao Liu, Xin Huang, Jinliang Zheng, Yu Liu, Hongsheng Li
- Abstract summary: Mixed and Masked AutoEncoder (MixMAE) is a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers.
This paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs.
- Score: 35.26148770111607
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but
efficient pretraining method that is applicable to various hierarchical Vision
Transformers. Existing masked image modeling (MIM) methods for hierarchical
Vision Transformers replace a random subset of input tokens with a special
[MASK] symbol and aim at reconstructing original image tokens from the
corrupted image. However, we find that using the [MASK] symbol greatly slows
down the training and causes pretraining-finetuning inconsistency, due to the
large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not
introduce [MASK] tokens at its encoder at all but is not applicable for
hierarchical Vision Transformers. To solve the issue and accelerate the
pretraining of hierarchical models, we replace the masked tokens of one image
with visible tokens of another image, i.e., creating a mixed image. We then
conduct dual reconstruction to reconstruct the two original images from the
mixed input, which significantly improves efficiency. While MixMAE can be
applied to various hierarchical Transformers, this paper explores using Swin
Transformer with a large window size and scales up to huge model size (to reach
600M parameters). Empirical results demonstrate that MixMAE can learn
high-quality visual representations efficiently. Notably, MixMAE with
Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600
epochs. Besides, its transfer performances on the other 6 datasets show that
MixMAE has better FLOPs / performance tradeoff than previous popular MIM
methods. Code is available at https://github.com/Sense-X/MixMIM.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders [89.12558126877532]
We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
arXiv Detail & Related papers (2024-03-26T16:04:19Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - TokenMixup: Efficient Attention-guided Token-level Data Augmentation for
Transformers [8.099977107670917]
TokenMixup is an efficient attention-guided token-level data augmentation method.
A variant of TokenMixup mixes tokens within a single instance, thereby enabling multi-scale feature augmentation.
Experiments show that our methods significantly improve the baseline models' performance on CIFAR and ImageNet-1K.
arXiv Detail & Related papers (2022-10-14T06:36:31Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.