A Light Recipe to Train Robust Vision Transformers
- URL: http://arxiv.org/abs/2209.07399v1
- Date: Thu, 15 Sep 2022 16:00:04 GMT
- Title: A Light Recipe to Train Robust Vision Transformers
- Authors: Edoardo Debenedetti, Vikash Sehwag, Prateek Mittal
- Abstract summary: We show that Vision Transformers (ViTs) can serve as an underlying architecture for improving the robustness of machine learning models against evasion attacks.
We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset.
We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k.
- Score: 34.51642006926379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we ask whether Vision Transformers (ViTs) can serve as an
underlying architecture for improving the adversarial robustness of machine
learning models against evasion attacks. While earlier works have focused on
improving Convolutional Neural Networks, we show that also ViTs are highly
suitable for adversarial training to achieve competitive performance. We
achieve this objective using a custom adversarial training recipe, discovered
using rigorous ablation studies on a subset of the ImageNet dataset. The
canonical training recipe for ViTs recommends strong data augmentation, in part
to compensate for the lack of vision inductive bias of attention modules, when
compared to convolutions. We show that this recipe achieves suboptimal
performance when used for adversarial training. In contrast, we find that
omitting all heavy data augmentation, and adding some additional bag-of-tricks
($\varepsilon$-warmup and larger weight decay), significantly boosts the
performance of robust ViTs. We show that our recipe generalizes to different
classes of ViT architectures and large-scale models on full ImageNet-1k.
Additionally, investigating the reasons for the robustness of our models, we
show that it is easier to generate strong attacks during training when using
our recipe and that this leads to better robustness at test time. Finally, we
further study one consequence of adversarial training by proposing a way to
quantify the semantic nature of adversarial perturbations and highlight its
correlation with the robustness of the model. Overall, we recommend that the
community should avoid translating the canonical training recipes in ViTs to
robust training and rethink common training choices in the context of
adversarial training.
Related papers
- SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization [39.09638432514626]
Vision Transformers (ViTs) are increasingly used in computer vision due to their high performance, but their vulnerability to adversarial attacks is a concern.
This study introduces SpecFormer, tailored to fortify ViTs against adversarial attacks, with theoretical underpinnings.
arXiv Detail & Related papers (2024-01-02T14:27:24Z) - MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness [31.603115393528746]
Building robust Vision Transformers (ViTs) is highly dependent on dedicated Adversarial Training (AT) strategies.
We provide a novel theoretical Mutual Information (MI) analysis in its autoencoder-based self-supervised pre-training.
We propose a masked autoencoder-based pre-training method, MIMIR, that employs an MI penalty to facilitate the adversarial training of ViTs.
arXiv Detail & Related papers (2023-12-08T10:50:02Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - Revisiting Adversarial Training for ImageNet: Architectures, Training
and Generalization across Threat Models [52.86163536826919]
We revisit adversarial training on ImageNet comparing ViTs and ConvNeXts.
Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust generalizations across different ranges of model parameters.
Our ViT + ConvStem yields the best generalization to unseen threat models.
arXiv Detail & Related papers (2023-03-03T11:53:01Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - Towards Efficient Adversarial Training on Vision Transformers [41.6396577241957]
Adversarial training is one of the most effective ways to accomplish robust CNNs.
We propose an efficient Attention Guided Adversarial Training mechanism.
With only 65% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
arXiv Detail & Related papers (2022-07-21T14:23:50Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.