Are Transformers More Robust Than CNNs?
- URL: http://arxiv.org/abs/2111.05464v1
- Date: Wed, 10 Nov 2021 00:18:59 GMT
- Title: Are Transformers More Robust Than CNNs?
- Authors: Yutong Bai, Jieru Mei, Alan Yuille, Cihang Xie
- Abstract summary: We provide the first fair & in-depth comparisons between Transformers and CNNs.
CNNs can easily be as robust as Transformers on defending against adversarial attacks.
Our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures.
- Score: 17.47001041042089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer emerges as a powerful tool for visual recognition. In addition to
demonstrating competitive performance on a broad range of visual benchmarks,
recent works also argue that Transformers are much more robust than
Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these
conclusions are drawn from unfair experimental settings, where Transformers and
CNNs are compared at different scales and are applied with distinct training
frameworks. In this paper, we aim to provide the first fair & in-depth
comparisons between Transformers and CNNs, focusing on robustness evaluations.
With our unified training setup, we first challenge the previous belief that
Transformers outshine CNNs when measuring adversarial robustness. More
surprisingly, we find CNNs can easily be as robust as Transformers on defending
against adversarial attacks, if they properly adopt Transformers' training
recipes. While regarding generalization on out-of-distribution samples, we show
pre-training on (external) large-scale datasets is not a fundamental request
for enabling Transformers to achieve better performance than CNNs. Moreover,
our ablations suggest such stronger generalization is largely benefited by the
Transformer's self-attention-like architectures per se, rather than by other
training setups. We hope this work can help the community better understand and
benchmark the robustness of Transformers and CNNs. The code and models are
publicly available at https://github.com/ytongbai/ViTs-vs-CNNs.
Related papers
- The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel
Size might be All You Need [103.31261028244782]
Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs)
Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks.
People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
arXiv Detail & Related papers (2023-12-09T22:23:57Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - An Impartial Take to the CNN vs Transformer Robustness Contest [89.97450887997925]
Recent state-of-the-art CNNs can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers.
Although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks.
arXiv Detail & Related papers (2022-07-22T21:34:37Z) - Can CNNs Be More Robust Than Transformers? [29.615791409258804]
Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade.
Recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups.
It is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se.
arXiv Detail & Related papers (2022-06-07T17:17:07Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z) - On the Robustness of Vision Transformers to Adversarial Examples [7.627299398469961]
We study the robustness of Vision Transformers to adversarial examples.
We show that adversarial examples do not readily transfer between CNNs and transformers.
Under a black-box adversary, we show that an ensemble can achieve unprecedented robustness without sacrificing clean accuracy.
arXiv Detail & Related papers (2021-03-31T00:29:12Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - Face Transformer for Recognition [67.02323570055894]
We investigate the performance of Transformer models in face recognition.
The models are trained on a large scale face recognition database MS-Celeb-1M.
We demonstrate that Transformer models achieve comparable performance as CNN with similar number of parameters and MACs.
arXiv Detail & Related papers (2021-03-27T03:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.