Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to
CNNs
- URL: http://arxiv.org/abs/2110.02797v1
- Date: Wed, 6 Oct 2021 14:18:47 GMT
- Title: Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to
CNNs
- Authors: Philipp Benz, Soomin Ham, Chaoning Zhang, Adil Karjauv, In So Kweon
- Abstract summary: Convolutional Neural Networks (CNNs) have become the de facto gold standard in computer vision applications.
New model architectures have been proposed challenging the status quo.
- Score: 71.44985408214431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural Networks (CNNs) have become the de facto gold standard
in computer vision applications in the past years. Recently, however, new model
architectures have been proposed challenging the status quo. The Vision
Transformer (ViT) relies solely on attention modules, while the MLP-Mixer
architecture substitutes the self-attention modules with Multi-Layer
Perceptrons (MLPs). Despite their great success, CNNs have been widely known to
be vulnerable to adversarial attacks, causing serious concerns for
security-sensitive applications. Thus, it is critical for the community to know
whether the newly proposed ViT and MLP-Mixer are also vulnerable to adversarial
attacks. To this end, we empirically evaluate their adversarial robustness
under several adversarial attack setups and benchmark them against the widely
used CNNs. Overall, we find that the two architectures, especially ViT, are
more robust than their CNN models. Using a toy example, we also provide
empirical evidence that the lower adversarial robustness of CNNs can be
partially attributed to their shift-invariant property. Our frequency analysis
suggests that the most robust ViT architectures tend to rely more on
low-frequency features compared with CNNs. Additionally, we have an intriguing
finding that MLP-Mixer is extremely vulnerable to universal adversarial
perturbations.
Related papers
- Query-Efficient Hard-Label Black-Box Attack against Vision Transformers [9.086983253339069]
Vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs)
This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario.
We propose a novel query-efficient hard-label adversarial attack method called AdvViT.
arXiv Detail & Related papers (2024-06-29T10:09:12Z) - Evaluating Adversarial Robustness in the Spatial Frequency Domain [13.200404022208858]
Convolutional Neural Networks (CNNs) have dominated the majority of computer vision tasks.
CNNs' vulnerability to adversarial attacks has raised concerns about deploying these models to safety-critical applications.
This paper presents an empirical study exploring the vulnerability of CNN models in the frequency domain.
arXiv Detail & Related papers (2024-05-10T09:20:47Z) - Robust Mixture-of-Expert Training for Convolutional Neural Networks [141.3531209949845]
Sparsely-gated Mixture of Expert (MoE) has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference.
We propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE.
We find that AdvMoE achieves 1% 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE.
arXiv Detail & Related papers (2023-08-19T20:58:21Z) - Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image
Classification [4.843654097048771]
Vision Transformers (ViT) are competing to replace Convolutional Neural Networks (CNN) for various computer vision tasks in medical imaging.
Recent works have shown that ViTs are also susceptible to such attacks and suffer significant performance degradation under attack.
We propose a novel self-ensembling method to enhance the robustness of ViT in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-08-04T19:02:24Z) - An Impartial Take to the CNN vs Transformer Robustness Contest [89.97450887997925]
Recent state-of-the-art CNNs can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers.
Although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks.
arXiv Detail & Related papers (2022-07-22T21:34:37Z) - Patch-Fool: Are Vision Transformers Always Robust Against Adversarial
Perturbations? [21.32962679185015]
Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in vision tasks.
Recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs)
We propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component.
arXiv Detail & Related papers (2022-03-16T04:45:59Z) - Neural Architecture Dilation for Adversarial Robustness [56.18555072877193]
A shortcoming of convolutional neural networks is that they are vulnerable to adversarial attacks.
This paper aims to improve the adversarial robustness of the backbone CNNs that have a satisfactory accuracy.
Under a minimal computational overhead, a dilation architecture is expected to be friendly with the standard performance of the backbone CNN.
arXiv Detail & Related papers (2021-08-16T03:58:00Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - Extreme Value Preserving Networks [65.2037926048262]
Recent evidence shows that convolutional neural networks (CNNs) are biased towards textures so that CNNs are non-robust to adversarial perturbations over textures.
This paper aims to leverage good properties of SIFT to renovate CNN architectures towards better accuracy and robustness.
arXiv Detail & Related papers (2020-11-17T02:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.