Can CNNs Be More Robust Than Transformers?
- URL: http://arxiv.org/abs/2206.03452v1
- Date: Tue, 7 Jun 2022 17:17:07 GMT
- Title: Can CNNs Be More Robust Than Transformers?
- Authors: Zeyu Wang, Yutong Bai, Yuyin Zhou, Cihang Xie
- Abstract summary: Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade.
Recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups.
It is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se.
- Score: 29.615791409258804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent success of Vision Transformers is shaking the long dominance of
Convolutional Neural Networks (CNNs) in image recognition for a decade.
Specifically, in terms of robustness on out-of-distribution samples, recent
research finds that Transformers are inherently more robust than CNNs,
regardless of different training setups. Moreover, it is believed that such
superiority of Transformers should largely be credited to their
self-attention-like architectures per se. In this paper, we question that
belief by closely examining the design of Transformers. Our findings lead to
three highly effective architecture designs for boosting robustness, yet simple
enough to be implemented in several lines of code, namely a) patchifying input
images, b) enlarging kernel size, and c) reducing activation layers and
normalization layers. Bringing these components together, we are able to build
pure CNN architectures without any attention-like operations that is as robust
as, or even more robust than, Transformers. We hope this work can help the
community better understand the design of robust neural architectures. The code
is publicly available at https://github.com/UCSC-VLAA/RobustCNN.
Related papers
- OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel
Size might be All You Need [103.31261028244782]
Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs)
Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks.
People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
arXiv Detail & Related papers (2023-12-09T22:23:57Z) - Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - An Impartial Take to the CNN vs Transformer Robustness Contest [89.97450887997925]
Recent state-of-the-art CNNs can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers.
Although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks.
arXiv Detail & Related papers (2022-07-22T21:34:37Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Are Transformers More Robust Than CNNs? [17.47001041042089]
We provide the first fair & in-depth comparisons between Transformers and CNNs.
CNNs can easily be as robust as Transformers on defending against adversarial attacks.
Our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures.
arXiv Detail & Related papers (2021-11-10T00:18:59Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z) - On the Robustness of Vision Transformers to Adversarial Examples [7.627299398469961]
We study the robustness of Vision Transformers to adversarial examples.
We show that adversarial examples do not readily transfer between CNNs and transformers.
Under a black-box adversary, we show that an ensemble can achieve unprecedented robustness without sacrificing clean accuracy.
arXiv Detail & Related papers (2021-03-31T00:29:12Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.