On Improving Adversarial Transferability of Vision Transformers
- URL: http://arxiv.org/abs/2106.04169v1
- Date: Tue, 8 Jun 2021 08:20:38 GMT
- Title: On Improving Adversarial Transferability of Vision Transformers
- Authors: Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Fahad Shahbaz Khan,
Fatih Porikli
- Abstract summary: Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
- Score: 97.17154635766578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) process input images as sequences of patches via
self-attention; a radically different architecture than convolutional neural
networks (CNNs). This makes it interesting to study the adversarial feature
space of ViT models and their transferability. In particular, we observe that
adversarial patterns found via conventional adversarial attacks show very low
black-box transferability even for large ViT models. However, we show that this
phenomenon is only due to the sub-optimal attack procedures that do not
leverage the true representation potential of ViTs. A deep ViT is composed of
multiple blocks, with a consistent architecture comprising of self-attention
and feed-forward layers, where each block is capable of independently producing
a class token. Formulating an attack using only the last class token
(conventional approach) does not directly leverage the discriminative
information stored in the earlier tokens, leading to poor adversarial
transferability of ViTs. Using the compositional nature of ViT models, we
enhance the transferability of existing attacks by introducing two novel
strategies specific to the architecture of ViT models. (i) Self-Ensemble: We
propose a method to find multiple discriminative pathways by dissecting a
single ViT model into an ensemble of networks. This allows explicitly utilizing
class-specific information at each ViT block. (ii) Token Refinement: We then
propose to refine the tokens to further enhance the discriminative capacity at
each block of ViT. Our token refinement systematically combines the class
tokens with structural information preserved within the patch tokens. An
adversarial attack, when applied to such refined tokens within the ensemble of
classifiers found in a single vision transformer, has significantly higher
transferability.
Related papers
- Attacking Transformers with Feature Diversity Adversarial Perturbation [19.597912600568026]
We present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black box models.
Our inspiration comes from the feature collapse phenomenon in ViTs, where the critical attention mechanism overly depends on the low-frequency component of features.
arXiv Detail & Related papers (2024-03-10T00:55:58Z) - Multi-Attribute Vision Transformers are Efficient and Robust Learners [4.53923275658276]
Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs)
We present a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks.
We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes.
arXiv Detail & Related papers (2024-02-12T21:31:13Z) - Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image
Classification [4.843654097048771]
Vision Transformers (ViT) are competing to replace Convolutional Neural Networks (CNN) for various computer vision tasks in medical imaging.
Recent works have shown that ViTs are also susceptible to such attacks and suffer significant performance degradation under attack.
We propose a novel self-ensembling method to enhance the robustness of ViT in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-08-04T19:02:24Z) - Improving the Transferability of Adversarial Examples with Restructure
Embedded Patches [4.476012751070559]
We attack the unique self-attention mechanism in ViTs by restructuring the embedded patches of the input.
Our method generates adversarial examples on white-box ViTs with higher transferability and higher image quality.
arXiv Detail & Related papers (2022-04-27T03:22:55Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Towards Transferable Adversarial Attacks on Vision Transformers [110.55845478440807]
Vision transformers (ViTs) have demonstrated impressive performance on a series of computer vision tasks, yet they still suffer from adversarial examples.
We introduce a dual attack framework, which contains a Pay No Attention (PNA) attack and a PatchOut attack, to improve the transferability of adversarial samples across different ViTs.
arXiv Detail & Related papers (2021-09-09T11:28:25Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.