When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture
- URL: http://arxiv.org/abs/2210.07540v1
- Date: Fri, 14 Oct 2022 05:37:20 GMT
- Title: When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture
- Authors: Yichuan Mo, Dongxian Wu, Yifei Wang, Yiwen Guo, Yisen Wang
- Abstract summary: Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
- Score: 32.260596998171835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have recently achieved competitive performance in
broad vision tasks. Unfortunately, on popular threat models, naturally trained
ViTs are shown to provide no more adversarial robustness than convolutional
neural networks (CNNs). Adversarial training is still required for ViTs to
defend against such adversarial attacks. In this paper, we provide the first
and comprehensive study on the adversarial training recipe of ViTs via
extensive evaluation of various training techniques across benchmark datasets.
We find that pre-training and SGD optimizer are necessary for ViTs' adversarial
training. Further considering ViT as a new type of model architecture, we
investigate its adversarial robustness from the perspective of its unique
architectural components. We find, when randomly masking gradients from some
attention blocks or masking perturbations on some patches during adversarial
training, the adversarial robustness of ViTs can be remarkably improved, which
may potentially open up a line of work to explore the architectural information
inside the newly designed models like ViTs. Our code is available at
https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers.
Related papers
- MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness [31.603115393528746]
Building robust Vision Transformers (ViTs) is highly dependent on dedicated Adversarial Training (AT) strategies.
We provide a novel theoretical Mutual Information (MI) analysis in its autoencoder-based self-supervised pre-training.
We propose a masked autoencoder-based pre-training method, MIMIR, that employs an MI penalty to facilitate the adversarial training of ViTs.
arXiv Detail & Related papers (2023-12-08T10:50:02Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - A Light Recipe to Train Robust Vision Transformers [34.51642006926379]
We show that Vision Transformers (ViTs) can serve as an underlying architecture for improving the robustness of machine learning models against evasion attacks.
We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset.
We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k.
arXiv Detail & Related papers (2022-09-15T16:00:04Z) - Towards Efficient Adversarial Training on Vision Transformers [41.6396577241957]
Adversarial training is one of the most effective ways to accomplish robust CNNs.
We propose an efficient Attention Guided Adversarial Training mechanism.
With only 65% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
arXiv Detail & Related papers (2022-07-21T14:23:50Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Evaluating Vision Transformer Methods for Deep Reinforcement Learning
from Pixels [7.426118390008397]
We evaluate Vision Transformers (ViT) training methods for image-based reinforcement learning control tasks.
We compare these results to a leading convolutional-network architecture method, RAD.
We find that the CNN architectures trained using RAD still generally provide superior performance.
arXiv Detail & Related papers (2022-04-11T07:10:58Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.