How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining
- URL: http://arxiv.org/abs/2403.02233v2
- Date: Wed, 5 Jun 2024 00:22:56 GMT
- Title: How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining
- Authors: Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang,
- Abstract summary: We provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining.
On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns.
On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings.
- Score: 66.08606211686339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked reconstruction, which predicts randomly masked patches from unmasked ones, has emerged as an important approach in self-supervised pretraining. However, the theoretical understanding of masked pretraining is rather limited, especially for the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining. On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns, on data distributions with spatial structures that highlight feature-position correlations. On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings, which is developed based on a careful analysis tracking the interplay between feature-wise and position-wise attention correlations.
Related papers
- CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder [44.94921073819524]
We propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences.
In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning.
arXiv Detail & Related papers (2024-06-09T13:14:00Z) - On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with
Masked Autoencoders [7.133110402648305]
This study explores the application of self-supervised learning to the task of motion forecasting.
Forecast-MAE is an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task.
arXiv Detail & Related papers (2023-08-19T02:27:51Z) - ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative.
We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Spatial Entropy Regularization for Vision Transformers [71.44392961125807]
Vision Transformers (VTs) can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised.
We propose a VT regularization method based on a spatial formulation of the information entropy.
We show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures.
arXiv Detail & Related papers (2022-06-09T17:34:39Z) - RePre: Improving Self-Supervised Vision Transformer with Reconstructive
Pre-training [80.44284270879028]
This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre)
Our RePre extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective.
arXiv Detail & Related papers (2022-01-18T10:24:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.