Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers
- URL: http://arxiv.org/abs/2203.14313v1
- Date: Sun, 27 Mar 2022 14:23:29 GMT
- Title: Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers
- Authors: Yunjie Tian and Lingxi Xie and Jiemin Fang and Mengnan Shi and Junran
Peng and Xiaopeng Zhang and Jianbin Jiao and Qi Tian and Qixiang Ye
- Abstract summary: Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
- Score: 122.01591448013977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The past year has witnessed a rapid development of masked image modeling
(MIM). MIM is mostly built upon the vision transformers, which suggests that
self-supervised visual representations can be done by masking input image parts
while requiring the target model to recover the missing contents. MIM has
demonstrated promising results on downstream tasks, yet we are interested in
whether there exist other effective ways to `learn by recovering missing
contents'. In this paper, we investigate this topic by designing five other
learning objectives that follow the same procedure as MIM but degrade the input
image in different ways. With extensive experiments, we manage to summarize a
few design principles for token-based pre-training of vision transformers. In
particular, the best practice is obtained by keeping the original image style
and enriching spatial masking with spatial misalignment -- this design achieves
superior performance over MIM in a series of downstream recognition tasks
without extra computational cost. The code is available at
https://github.com/sunsmarterjie/beyond_masking.
Related papers
- Membership Inference Attack Against Masked Image Modeling [29.699606401861818]
Masked Image Modeling (MIM) has achieved significant success in the realm of self-supervised learning (SSL) for visual recognition.
In this work, we take a different angle by studying the pre-training data privacy of MIM.
We propose the first membership inference attack against image encoders pre-trained by MIM.
arXiv Detail & Related papers (2024-08-13T11:34:28Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - Masked Visual Reconstruction in Language Semantic Space [38.43966132249977]
Masked visual Reconstruction In Language semantic Space (RILS) pre-training framework is presented.
RILS transforms vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets.
Our method exhibits advanced transferability on downstream classification, detection, and segmentation.
arXiv Detail & Related papers (2023-01-17T15:32:59Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN [38.87225202482656]
Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers.
We propose an Architecture-Agnostic Masked Image Modeling framework (A$2$MIM), which is compatible with both Transformers and CNNs in a unified way.
arXiv Detail & Related papers (2022-05-27T12:42:02Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.