Improve Vision Transformers Training by Suppressing Over-smoothing
- URL: http://arxiv.org/abs/2104.12753v1
- Date: Mon, 26 Apr 2021 17:43:04 GMT
- Title: Improve Vision Transformers Training by Suppressing Over-smoothing
- Authors: Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu
- Abstract summary: Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
- Score: 28.171262066145612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Introducing the transformer structure into computer vision tasks holds the
promise of yielding a better speed-accuracy trade-off than traditional
convolution networks. However, directly training vanilla transformers on vision
tasks has been shown to yield unstable and sub-optimal results. As a result,
recent works propose to modify transformer structures by incorporating
convolutional layers to improve the performance on vision tasks. This work
investigates how to stabilize the training of vision transformers
\emph{without} special structure modification. We observe that the instability
of transformer training on vision tasks can be attributed to the over-smoothing
problem, that the self-attention layers tend to map the different patches from
the input image into a similar latent representation, hence yielding the loss
of information and degeneration of performance, especially when the number of
layers is large. We then propose a number of techniques to alleviate this
problem, including introducing additional loss functions to encourage
diversity, prevent loss of information, and discriminate different patches by
additional patch classification loss for Cutmix. We show that our proposed
techniques stabilize the training and allow us to train wider and deeper vision
transformers, achieving 85.0\% top-1 accuracy on ImageNet validation set
without introducing extra teachers or additional convolution layers. Our code
will be made publicly available at
https://github.com/ChengyueGongR/PatchVisionTransformer .
Related papers
- Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z) - Exploring and Improving Mobile Level Vision Transformers [81.7741384218121]
We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop.
We propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance.
arXiv Detail & Related papers (2021-08-30T06:42:49Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.