Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision
Transformers with Locality
- URL: http://arxiv.org/abs/2205.10063v1
- Date: Fri, 20 May 2022 10:16:30 GMT
- Title: Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision
Transformers with Locality
- Authors: Xiang Li, Wenhai Wang, Lingfeng Yang, Jian Yang
- Abstract summary: Masked AutoEncoder (MAE) has led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design.
We propose Uniform Masking (UM) to enable MAE pre-training for Pyramid-based ViTs with locality.
- Score: 28.245387355693545
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Masked AutoEncoder (MAE) has recently led the trends of visual
self-supervision area by an elegant asymmetric encoder-decoder design, which
significantly optimizes both the pre-training efficiency and fine-tuning
accuracy. Notably, the success of the asymmetric structure relies on the
"global" property of Vanilla Vision Transformer (ViT), whose self-attention
mechanism reasons over arbitrary subset of discrete image patches. However, it
is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be
adopted in MAE pre-training as they commonly introduce operators within "local"
windows, making it difficult to handle the random sequence of partial vision
tokens. In this paper, we propose Uniform Masking (UM), successfully enabling
MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for
short). Specifically, UM includes a Uniform Sampling (US) that strictly samples
$1$ random patch from each $2 \times 2$ grid, and a Secondary Masking (SM)
which randomly masks a portion of (usually $25\%$) the already sampled regions
as learnable tokens. US preserves equivalent elements across multiple
non-overlapped local windows, resulting in the smooth support for popular
Pyramid-based ViTs; whilst SM is designed for better transferable visual
representations since US reduces the difficulty of pixel recovery pre-task that
hinders the semantic learning. We demonstrate that UM-MAE significantly
improves the pre-training efficiency (e.g., it speeds up and reduces the GPU
memory by $\sim 2\times$) of Pyramid-based ViTs, but maintains the competitive
fine-tuning performance across downstream tasks. For example using HTC++
detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only
in ImageNet-1K can even outperform the one supervised in ImageNet-22K. The
codes are available at https://github.com/implus/UM-MAE.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Toward a Deeper Understanding: RetNet Viewed through Convolution [25.8904146140577]
Vision Transformer (ViT) can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources.
This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain.
We propose a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks.
arXiv Detail & Related papers (2023-09-11T10:54:22Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - AdPE: Adversarial Positional Embeddings for Pretraining Vision
Transformers via MAE+ [44.856035786948915]
We propose an Adversarial Positional Embedding (AdPE) approach to pretrain vision transformers.
AdPE distorts the local visual structures by perturbing the position encodings.
Experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE.
arXiv Detail & Related papers (2023-03-14T02:42:01Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners [20.846232536796578]
Self-supervised Masked Autoencoders (MAE) have attracted unprecedented attention for their impressive representation learning ability.
This paper extends MAE to a fully supervised setting by adding a supervised classification branch.
The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used.
arXiv Detail & Related papers (2022-05-28T23:05:03Z) - Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z) - ConvMAE: Masked Convolution Meets Masked Autoencoders [65.15953258300958]
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT.
Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
arXiv Detail & Related papers (2022-05-08T15:12:19Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.