On the Adversarial Robustness of Visual Transformers
- URL: http://arxiv.org/abs/2103.15670v1
- Date: Mon, 29 Mar 2021 14:48:24 GMT
- Title: On the Adversarial Robustness of Visual Transformers
- Authors: Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, Cho-Jui Hsieh
- Abstract summary: This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
- Score: 129.29523847765952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the success in advancing natural language processing and
understanding, transformers are expected to bring revolutionary changes to
computer vision. This work provides the first and comprehensive study on the
robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs
possess better adversarial robustness when compared with convolutional neural
networks (CNNs). We summarize the following main observations contributing to
the improved robustness of ViTs:
1) Features learned by ViTs contain less low-level information and are more
generalizable, which contributes to superior robustness against adversarial
perturbations.
2) Introducing convolutional or tokens-to-token blocks for learning low-level
features in ViTs can improve classification accuracy but at the cost of
adversarial robustness.
3) Increasing the proportion of transformers in the model structure (when the
model consists of both transformer and CNN blocks) leads to better robustness.
But for a pure transformer model, simply increasing the size or adding layers
cannot guarantee a similar effect.
4) Pre-training on larger datasets does not significantly improve adversarial
robustness though it is critical for training ViTs.
5) Adversarial training is also applicable to ViT for training robust models.
Furthermore, feature visualization and frequency analysis are conducted for
explanation. The results show that ViTs are less sensitive to high-frequency
perturbations than CNNs and there is a high correlation between how well the
model learns low-level features and its robustness against different
frequency-based perturbations.
Related papers
- Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos.
It examines their potential for improved generalization and explainability, especially with limited training data.
By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.