Rethinking the Design Principles of Robust Vision Transformer
- URL: http://arxiv.org/abs/2105.07926v1
- Date: Mon, 17 May 2021 15:04:15 GMT
- Title: Rethinking the Design Principles of Robust Vision Transformer
- Authors: Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Shaokai Ye, Yuan He,
Hui Xue
- Abstract summary: Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks.
In this paper, we rethink the design principles of ViTs based on the robustness.
By combining the robust design components, we propose Robust Vision Transformer (RVT)
- Score: 28.538786330184642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances on Vision Transformers (ViT) have shown that
self-attention-based networks, which take advantage of long-range dependencies
modeling ability, surpassed traditional convolution neural networks (CNNs) in
most vision tasks. To further expand the applicability for computer vision,
many improved variants are proposed to re-design the Transformer architecture
by considering the superiority of CNNs, i.e., locality, translation invariance,
for better performance. However, these methods only consider the standard
accuracy or computation cost of the model. In this paper, we rethink the design
principles of ViTs based on the robustness. We found some design components
greatly harm the robustness and generalization ability of ViTs while some
others are beneficial. By combining the robust design components, we propose
Robust Vision Transformer (RVT). RVT is a new vision transformer, which has
superior performance and strong robustness. We further propose two new
plug-and-play techniques called position-aware attention rescaling and
patch-wise augmentation to train our RVT. The experimental results on ImageNet
and six robustness benchmarks show the advanced robustness and generalization
ability of RVT compared with previous Transformers and state-of-the-art CNNs.
Our RVT-S* also achieves Top-1 rank on multiple robustness leaderboards
including ImageNet-C and ImageNet-Sketch. The code will be available at
https://github.com/vtddggg/Robust-Vision-Transformer.
Related papers
- Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [25.63398340113755]
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime.
We introduce the attention bias, a new way to integrate positional information in vision transformers.
Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff.
arXiv Detail & Related papers (2021-04-02T16:29:57Z) - Rethinking Spatial Dimensions of Vision Transformers [34.13899937264952]
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks.
We investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture.
We propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model.
arXiv Detail & Related papers (2021-03-30T12:51:28Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.