Related papers: A Light Recipe to Train Robust Vision Transformers

A Light Recipe to Train Robust Vision Transformers

URL: http://arxiv.org/abs/2209.07399v1
Date: Thu, 15 Sep 2022 16:00:04 GMT
Title: A Light Recipe to Train Robust Vision Transformers
Authors: Edoardo Debenedetti, Vikash Sehwag, Prateek Mittal
Abstract summary: We show that Vision Transformers (ViTs) can serve as an underlying architecture for improving the robustness of machine learning models against evasion attacks. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k.
Score: 34.51642006926379
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we ask whether Vision Transformers (ViTs) can serve as an underlying architecture for improving the adversarial robustness of machine learning models against evasion attacks. While earlier works have focused on improving Convolutional Neural Networks, we show that also ViTs are highly suitable for adversarial training to achieve competitive performance. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. The canonical training recipe for ViTs recommends strong data augmentation, in part to compensate for the lack of vision inductive bias of attention modules, when compared to convolutions. We show that this recipe achieves suboptimal performance when used for adversarial training. In contrast, we find that omitting all heavy data augmentation, and adding some additional bag-of-tricks ($\varepsilon$-warmup and larger weight decay), significantly boosts the performance of robust ViTs. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k. Additionally, investigating the reasons for the robustness of our models, we show that it is easier to generate strong attacks during training when using our recipe and that this leads to better robustness at test time. Finally, we further study one consequence of adversarial training by proposing a way to quantify the semantic nature of adversarial perturbations and highlight its correlation with the robustness of the model. Overall, we recommend that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.

Related papers

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization [39.09638432514626]
Vision Transformers (ViTs) are increasingly used in computer vision due to their high performance, but their vulnerability to adversarial attacks is a concern. This study introduces SpecFormer, tailored to fortify ViTs against adversarial attacks, with theoretical underpinnings.
arXiv Detail & Related papers (2024-01-02T14:27:24Z)
MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness [31.603115393528746]
Building robust Vision Transformers (ViTs) is highly dependent on dedicated Adversarial Training (AT) strategies. We provide a novel theoretical Mutual Information (MI) analysis in its autoencoder-based self-supervised pre-training. We propose a masked autoencoder-based pre-training method, MIMIR, that employs an MI penalty to facilitate the adversarial training of ViTs.
arXiv Detail & Related papers (2023-12-08T10:50:02Z)
Experts Weights Averaging: A New General Training Scheme for Vision Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z)
Revisiting Adversarial Training for ImageNet: Architectures, Training and Generalization across Threat Models [52.86163536826919]
We revisit adversarial training on ImageNet comparing ViTs and ConvNeXts. Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust generalizations across different ranges of model parameters. Our ViT + ConvStem yields the best generalization to unseen threat models.
arXiv Detail & Related papers (2023-03-03T11:53:01Z)
When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks. We find that pre-training and SGD are necessary for ViTs' adversarial training. Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z)
Towards Efficient Adversarial Training on Vision Transformers [41.6396577241957]
Adversarial training is one of the most effective ways to accomplish robust CNNs. We propose an efficient Attention Guided Adversarial Training mechanism. With only 65% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
arXiv Detail & Related papers (2022-07-21T14:23:50Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions. We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness. We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z)
Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels [7.426118390008397]
We evaluate Vision Transformers (ViT) training methods for image-based reinforcement learning control tasks. We compare these results to a leading convolutional-network architecture method, RAD. We find that the CNN architectures trained using RAD still generally provide superior performance.
arXiv Detail & Related papers (2022-04-11T07:10:58Z)
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z)
On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.