AdPE: Adversarial Positional Embeddings for Pretraining Vision
Transformers via MAE+
- URL: http://arxiv.org/abs/2303.07598v1
- Date: Tue, 14 Mar 2023 02:42:01 GMT
- Title: AdPE: Adversarial Positional Embeddings for Pretraining Vision
Transformers via MAE+
- Authors: Xiao Wang, Ying Wang, Ziwei Xuan, Guo-Jun Qi
- Abstract summary: We propose an Adversarial Positional Embedding (AdPE) approach to pretrain vision transformers.
AdPE distorts the local visual structures by perturbing the position encodings.
Experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE.
- Score: 44.856035786948915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised learning of vision transformers seeks to pretrain an encoder via
pretext tasks without labels. Among them is the Masked Image Modeling (MIM)
aligned with pretraining of language transformers by predicting masked patches
as a pretext task. A criterion in unsupervised pretraining is the pretext task
needs to be sufficiently hard to prevent the transformer encoder from learning
trivial low-level features not generalizable well to downstream tasks. For this
purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It
distorts the local visual structures by perturbing the position encodings so
that the learned transformer cannot simply use the locally correlated patches
to predict the missing ones. We hypothesize that it forces the transformer
encoder to learn more discriminative features in a global context with stronger
generalizability to downstream tasks. We will consider both absolute and
relative positional encodings, where adversarial positions can be imposed both
in the embedding mode and the coordinate mode. We will also present a new MAE+
baseline that brings the performance of the MIM pretraining to a new level with
the AdPE. The experiments demonstrate that our approach can improve the
fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of
pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it
outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and
by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively.
These results are obtained with the AdPE being a pure MIM approach that does
not use any extra models or external datasets for pretraining. The code is
available at https://github.com/maple-research-lab/AdPE.
Related papers
- Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Bootstrapped Masked Autoencoders for Vision BERT Pretraining [142.5285802605117]
BootMAE improves the original masked autoencoders (MAE) with two core designs.
1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining.
arXiv Detail & Related papers (2022-07-14T17:59:58Z) - Towards Understanding Why Mask-Reconstruction Pretraining Helps in
Downstream Tasks [129.1080795985234]
Mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder.
For a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch.
arXiv Detail & Related papers (2022-06-08T11:49:26Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision
Transformers with Locality [28.245387355693545]
Masked AutoEncoder (MAE) has led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design.
We propose Uniform Masking (UM) to enable MAE pre-training for Pyramid-based ViTs with locality.
arXiv Detail & Related papers (2022-05-20T10:16:30Z) - PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.