Are Vision Transformers Robust to Patch Perturbations?
- URL: http://arxiv.org/abs/2111.10659v1
- Date: Sat, 20 Nov 2021 19:00:51 GMT
- Title: Are Vision Transformers Robust to Patch Perturbations?
- Authors: Jindong Gu, Volker Tresp, Yao Qin
- Abstract summary: We study the robustness of vision transformers to patch-wise perturbations.
We reveal that ViT's stronger robustness to natural corrupted patches and higher vulnerability against adversarial patches are both caused by the attention mechanism.
- Score: 18.491213370656855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent advances in Vision Transformer (ViT) have demonstrated its
impressive performance in image classification, which makes it a promising
alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents
an input image as a sequence of image patches. The patch-wise input image
representation makes the following question interesting: How does ViT perform
when individual input image patches are perturbed with natural corruptions or
adversarial perturbations, compared to CNNs? In this work, we study the
robustness of vision transformers to patch-wise perturbations. Surprisingly, we
find that vision transformers are more robust to naturally corrupted patches
than CNNs, whereas they are more vulnerable to adversarial patches.
Furthermore, we conduct extensive qualitative and quantitative experiments to
understand the robustness to patch perturbations. We have revealed that ViT's
stronger robustness to natural corrupted patches and higher vulnerability
against adversarial patches are both caused by the attention mechanism.
Specifically, the attention model can help improve the robustness of vision
transformers by effectively ignoring natural corrupted patches. However, when
vision transformers are attacked by an adversary, the attention mechanism can
be easily fooled to focus more on the adversarially perturbed patches and cause
a mistake.
Related papers
- Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing [64.7892681641764]
We train vision transformers (ViTs) and convolutional neural networks (CNNs)
We find that ViTs do not improve nor degrade when trained using Patch Mixing.
We conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess.
arXiv Detail & Related papers (2023-06-30T17:59:53Z) - Patch-Fool: Are Vision Transformers Always Robust Against Adversarial
Perturbations? [21.32962679185015]
Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in vision tasks.
Recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs)
We propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component.
arXiv Detail & Related papers (2022-03-16T04:45:59Z) - Improved Robustness of Vision Transformer via PreLayerNorm in Patch
Embedding [4.961852023598131]
Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs)
This paper studies the behavior and robustness of ViT.
arXiv Detail & Related papers (2021-11-16T12:32:03Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Understanding and Improving Robustness of Vision Transformers through
Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure.
We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics.
We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z) - Certified Patch Robustness via Smoothed Vision Transformers [77.30663719482924]
We show how using vision transformers enables significantly better certified patch robustness.
These improvements stem from the inherent ability of the vision transformer to gracefully handle largely masked images.
arXiv Detail & Related papers (2021-10-11T17:44:05Z) - Towards Transferable Adversarial Attacks on Vision Transformers [110.55845478440807]
Vision transformers (ViTs) have demonstrated impressive performance on a series of computer vision tasks, yet they still suffer from adversarial examples.
We introduce a dual attack framework, which contains a Pay No Attention (PNA) attack and a PatchOut attack, to improve the transferability of adversarial samples across different ViTs.
arXiv Detail & Related papers (2021-09-09T11:28:25Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.