Reveal of Vision Transformers Robustness against Adversarial Attacks
- URL: http://arxiv.org/abs/2106.03734v1
- Date: Mon, 7 Jun 2021 15:59:49 GMT
- Title: Reveal of Vision Transformers Robustness against Adversarial Attacks
- Authors: Ahmed Aldahdooh, Wassim Hamidouche, Olivier Deforges
- Abstract summary: This work studies the robustness of ViT variants against different $L_p$-based adversarial attacks in comparison with CNNs.
We provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs.
- Score: 13.985121520800215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based networks have achieved state-of-the-art performance in many
computer vision tasks, such as image classification. Unlike Convolutional
Neural Network (CNN), the major part of the vanilla Vision Transformer (ViT) is
the attention block that brings the power of mimicking the global context of
the input image. This power is data hunger and hence, the larger the training
data the better the performance. To overcome this limitation, many ViT-based
networks, or hybrid-ViT, have been proposed to include local context during the
training. The robustness of ViTs and its variants against adversarial attacks
has not been widely invested in the literature. Some robustness attributes were
revealed in few previous works and hence, more insight robustness attributes
are yet unrevealed. This work studies the robustness of ViT variants 1) against
different $L_p$-based adversarial attacks in comparison with CNNs and 2) under
Adversarial Examples (AEs) after applying preprocessing defense methods. To
that end, we run a set of experiments on 1000 images from ImageNet-1k and then
provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust
than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more
robust than CNNs under $L_0$, $L_1$, $L_2$, $L_\infty$-based, and Color Channel
Perturbations (CCP) attacks. 2) Vanilla ViTs are not responding to
preprocessing defenses that mainly reduce the high frequency components while,
hybrid-ViTs are more responsive to such defense. 3) CCP can be used as a
preprocessing defense and larger ViT variants are found to be more responsive
than other models. Furthermore, feature maps, attention maps, and Grad-CAM
visualization jointly with image quality measures, and perturbations' energy
spectrum are provided for an insight understanding of attention-based models.
Related papers
- Query-Efficient Hard-Label Black-Box Attack against Vision Transformers [9.086983253339069]
Vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs)
This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario.
We propose a novel query-efficient hard-label adversarial attack method called AdvViT.
arXiv Detail & Related papers (2024-06-29T10:09:12Z) - Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image
Classification [4.843654097048771]
Vision Transformers (ViT) are competing to replace Convolutional Neural Networks (CNN) for various computer vision tasks in medical imaging.
Recent works have shown that ViTs are also susceptible to such attacks and suffer significant performance degradation under attack.
We propose a novel self-ensembling method to enhance the robustness of ViT in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-08-04T19:02:24Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - Patch-Fool: Are Vision Transformers Always Robust Against Adversarial
Perturbations? [21.32962679185015]
Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in vision tasks.
Recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs)
We propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component.
arXiv Detail & Related papers (2022-03-16T04:45:59Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - How to augment your ViTs? Consistency loss and StyleAug, a random style
transfer augmentation [4.3012765978447565]
The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks.
One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-12-16T23:56:04Z) - Improved Robustness of Vision Transformer via PreLayerNorm in Patch
Embedding [4.961852023598131]
Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs)
This paper studies the behavior and robustness of ViT.
arXiv Detail & Related papers (2021-11-16T12:32:03Z) - Towards Transferable Adversarial Attacks on Vision Transformers [110.55845478440807]
Vision transformers (ViTs) have demonstrated impressive performance on a series of computer vision tasks, yet they still suffer from adversarial examples.
We introduce a dual attack framework, which contains a Pay No Attention (PNA) attack and a PatchOut attack, to improve the transferability of adversarial samples across different ViTs.
arXiv Detail & Related papers (2021-09-09T11:28:25Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.