Scaled ReLU Matters for Training Vision Transformers
- URL: http://arxiv.org/abs/2109.03810v1
- Date: Wed, 8 Sep 2021 17:57:58 GMT
- Title: Scaled ReLU Matters for Training Vision Transformers
- Authors: Pichao Wang and Xue Wang and Hao Luo and Jingkai Zhou and Zhipeng Zhou
and Fan Wang and Hao Li and Rong Jin
- Abstract summary: Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs)
However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, warmup and warmup.
We verify, both theoretically and empirically, that scaled ReLU in textitconv-stem not only improves training stabilization, but also increases the diversity of patch tokens.
- Score: 45.41439457701873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) have been an alternative design paradigm to
convolutional neural networks (CNNs). However, the training of ViTs is much
harder than CNNs, as it is sensitive to the training parameters, such as
learning rate, optimizer and warmup epoch. The reasons for training difficulty
are empirically analysed in ~\cite{xiao2021early}, and the authors conjecture
that the issue lies with the \textit{patchify-stem} of ViT models and propose
that early convolutions help transformers see better. In this paper, we further
investigate this problem and extend the above conclusion: only early
convolutions do not help for stable training, but the scaled ReLU operation in
the \textit{convolutional stem} (\textit{conv-stem}) matters. We verify, both
theoretically and empirically, that scaled ReLU in \textit{conv-stem} not only
improves training stabilization, but also increases the diversity of patch
tokens, thus boosting peak performance with a large margin via adding few
parameters and flops. In addition, extensive experiments are conducted to
demonstrate that previous ViTs are far from being well trained, further showing
that ViTs have great potential to be a better substitute of CNNs.
Related papers
- Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training [29.20567759071523]
Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision.
This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound.
Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
arXiv Detail & Related papers (2021-12-07T07:56:50Z) - An Empirical Study of Training Self-Supervised Visual Transformers [70.27107708555185]
We study the effects of several fundamental components for training self-supervised Visual Transformers.
We reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
arXiv Detail & Related papers (2021-04-05T17:59:40Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.