On the Robustness of Vision Transformers to Adversarial Examples
- URL: http://arxiv.org/abs/2104.02610v2
- Date: Sat, 5 Jun 2021 00:31:29 GMT
- Title: On the Robustness of Vision Transformers to Adversarial Examples
- Authors: Kaleel Mahmood, Rigel Mahmood, Marten van Dijk
- Abstract summary: We study the robustness of Vision Transformers to adversarial examples.
We show that adversarial examples do not readily transfer between CNNs and transformers.
Under a black-box adversary, we show that an ensemble can achieve unprecedented robustness without sacrificing clean accuracy.
- Score: 7.627299398469961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in attention-based networks have shown that Vision
Transformers can achieve state-of-the-art or near state-of-the-art results on
many image classification tasks. This puts transformers in the unique position
of being a promising alternative to traditional convolutional neural networks
(CNNs). While CNNs have been carefully studied with respect to adversarial
attacks, the same cannot be said of Vision Transformers. In this paper, we
study the robustness of Vision Transformers to adversarial examples. Our
analyses of transformer security is divided into three parts. First, we test
the transformer under standard white-box and black-box attacks. Second, we
study the transferability of adversarial examples between CNNs and
transformers. We show that adversarial examples do not readily transfer between
CNNs and transformers. Based on this finding, we analyze the security of a
simple ensemble defense of CNNs and transformers. By creating a new attack, the
self-attention blended gradient attack, we show that such an ensemble is not
secure under a white-box adversary. However, under a black-box adversary, we
show that an ensemble can achieve unprecedented robustness without sacrificing
clean accuracy. Our analysis for this work is done using six types of white-box
attacks and two types of black-box attacks. Our study encompasses multiple
Vision Transformers, Big Transfer Models and CNN architectures trained on
CIFAR-10, CIFAR-100 and ImageNet.
Related papers
- Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles [65.54857068975068]
In this paper, we argue that this additional bulk is unnecessary.
By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer.
We create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models.
arXiv Detail & Related papers (2023-06-01T17:59:58Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - An Impartial Take to the CNN vs Transformer Robustness Contest [89.97450887997925]
Recent state-of-the-art CNNs can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers.
Although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks.
arXiv Detail & Related papers (2022-07-22T21:34:37Z) - Can CNNs Be More Robust Than Transformers? [29.615791409258804]
Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade.
Recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups.
It is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se.
arXiv Detail & Related papers (2022-06-07T17:17:07Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Are Transformers More Robust Than CNNs? [17.47001041042089]
We provide the first fair & in-depth comparisons between Transformers and CNNs.
CNNs can easily be as robust as Transformers on defending against adversarial attacks.
Our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures.
arXiv Detail & Related papers (2021-11-10T00:18:59Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - Towards Transferable Adversarial Attacks on Vision Transformers [110.55845478440807]
Vision transformers (ViTs) have demonstrated impressive performance on a series of computer vision tasks, yet they still suffer from adversarial examples.
We introduce a dual attack framework, which contains a Pay No Attention (PNA) attack and a PatchOut attack, to improve the transferability of adversarial samples across different ViTs.
arXiv Detail & Related papers (2021-09-09T11:28:25Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.