Related papers: Can CNNs Be More Robust Than Transformers?

Can CNNs Be More Robust Than Transformers?

URL: http://arxiv.org/abs/2206.03452v1
Date: Tue, 7 Jun 2022 17:17:07 GMT
Title: Can CNNs Be More Robust Than Transformers?
Authors: Zeyu Wang, Yutong Bai, Yuyin Zhou, Cihang Xie
Abstract summary: Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. It is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se.
Score: 29.615791409258804
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures. The code is publicly available at https://github.com/UCSC-VLAA/RobustCNN.

Related papers

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z)
The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel Size might be All You Need [103.31261028244782]
Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs) Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks. People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL.
arXiv Detail & Related papers (2023-12-09T22:23:57Z)
Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework. Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z)
An Impartial Take to the CNN vs Transformer Robustness Contest [89.97450887997925]
Recent state-of-the-art CNNs can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers. Although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks.
arXiv Detail & Related papers (2022-07-22T21:34:37Z)
Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification. We find Vision Transformers perform poorly on a semi-supervised ImageNet setting. CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z)
Are Transformers More Robust Than CNNs? [17.47001041042089]
We provide the first fair & in-depth comparisons between Transformers and CNNs. CNNs can easily be as robust as Transformers on defending against adversarial attacks. Our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures.
arXiv Detail & Related papers (2021-11-10T00:18:59Z)
Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch. We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data. We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z)
Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z)
On the Robustness of Vision Transformers to Adversarial Examples [7.627299398469961]
We study the robustness of Vision Transformers to adversarial examples. We show that adversarial examples do not readily transfer between CNNs and transformers. Under a black-box adversary, we show that an ensemble can achieve unprecedented robustness without sacrificing clean accuracy.
arXiv Detail & Related papers (2021-03-31T00:29:12Z)
Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.