Patch-Fool: Are Vision Transformers Always Robust Against Adversarial
Perturbations?
- URL: http://arxiv.org/abs/2203.08392v1
- Date: Wed, 16 Mar 2022 04:45:59 GMT
- Title: Patch-Fool: Are Vision Transformers Always Robust Against Adversarial
Perturbations?
- Authors: Yonggan Fu, Shunyao Zhang, Shang Wu, Cheng Wan, Yingyan Lin
- Abstract summary: Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in vision tasks.
Recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs)
We propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component.
- Score: 21.32962679185015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have recently set off a new wave in neural
architecture design thanks to their record-breaking performance in various
vision tasks. In parallel, to fulfill the goal of deploying ViTs into
real-world vision applications, their robustness against potential malicious
attacks has gained increasing attention. In particular, recent works show that
ViTs are more robust against adversarial attacks as compared with convolutional
neural networks (CNNs), and conjecture that this is because ViTs focus more on
capturing global interactions among different input/feature patches, leading to
their improved robustness to local perturbations imposed by adversarial
attacks. In this work, we ask an intriguing question: "Under what kinds of
perturbations do ViTs become more vulnerable learners compared to CNNs?" Driven
by this question, we first conduct a comprehensive experiment regarding the
robustness of both ViTs and CNNs under various existing adversarial attacks to
understand the underlying reason favoring their robustness. Based on the drawn
insights, we then propose a dedicated attack framework, dubbed Patch-Fool, that
fools the self-attention mechanism by attacking its basic component (i.e., a
single patch) with a series of attention-aware optimization techniques.
Interestingly, our Patch-Fool framework shows for the first time that ViTs are
not necessarily more robust than CNNs against adversarial perturbations. In
particular, we find that ViTs are more vulnerable learners compared with CNNs
against our Patch-Fool attack which is consistent across extensive experiments,
and the observations from Sparse/Mild Patch-Fool, two variants of Patch-Fool,
indicate an intriguing insight that the perturbation density and strength on
each patch seem to be the key factors that influence the robustness ranking
between ViTs and CNNs.
Related papers
- Query-Efficient Hard-Label Black-Box Attack against Vision Transformers [9.086983253339069]
Vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs)
This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario.
We propose a novel query-efficient hard-label adversarial attack method called AdvViT.
arXiv Detail & Related papers (2024-06-29T10:09:12Z) - Defending Backdoor Attacks on Vision Transformer via Patch Processing [18.50522247164383]
Vision Transformers (ViTs) have a radically different architecture with significantly less inductive bias than Convolutional Neural Networks.
This paper investigates a representative causative attack, i.e., backdoor attacks.
We propose an effective method for ViTs to defend both patch-based and blending-based trigger backdoor attacks via patch processing.
arXiv Detail & Related papers (2022-06-24T17:29:47Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Are Vision Transformers Robust to Patch Perturbations? [18.491213370656855]
We study the robustness of vision transformers to patch-wise perturbations.
We reveal that ViT's stronger robustness to natural corrupted patches and higher vulnerability against adversarial patches are both caused by the attention mechanism.
arXiv Detail & Related papers (2021-11-20T19:00:51Z) - Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to
CNNs [71.44985408214431]
Convolutional Neural Networks (CNNs) have become the de facto gold standard in computer vision applications.
New model architectures have been proposed challenging the status quo.
arXiv Detail & Related papers (2021-10-06T14:18:47Z) - Towards Transferable Adversarial Attacks on Vision Transformers [110.55845478440807]
Vision transformers (ViTs) have demonstrated impressive performance on a series of computer vision tasks, yet they still suffer from adversarial examples.
We introduce a dual attack framework, which contains a Pay No Attention (PNA) attack and a PatchOut attack, to improve the transferability of adversarial samples across different ViTs.
arXiv Detail & Related papers (2021-09-09T11:28:25Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - Reveal of Vision Transformers Robustness against Adversarial Attacks [13.985121520800215]
This work studies the robustness of ViT variants against different $L_p$-based adversarial attacks in comparison with CNNs.
We provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs.
arXiv Detail & Related papers (2021-06-07T15:59:49Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.