Understanding Robustness of Transformers for Image Classification
- URL: http://arxiv.org/abs/2103.14586v1
- Date: Fri, 26 Mar 2021 16:47:55 GMT
- Title: Understanding Robustness of Transformers for Image Classification
- Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li,
Thomas Unterthiner, Andreas Veit
- Abstract summary: Vision Transformer (ViT) has surpassed ResNets for image classification.
Details of the Transformer architecture lead one to wonder whether these networks are as robust.
We find that ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations.
- Score: 34.51672491103555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Convolutional Neural Networks (CNNs) have long been the architecture of
choice for computer vision tasks. Recently, Transformer-based architectures
like Vision Transformer (ViT) have matched or even surpassed ResNets for image
classification. However, details of the Transformer architecture -- such as the
use of non-overlapping patches -- lead one to wonder whether these networks are
as robust. In this paper, we perform an extensive study of a variety of
different measures of robustness of ViT models and compare the findings to
ResNet baselines. We investigate robustness to input perturbations as well as
robustness to model perturbations. We find that when pre-trained with a
sufficient amount of data, ViT models are at least as robust as the ResNet
counterparts on a broad range of perturbations. We also find that Transformers
are robust to the removal of almost any single layer, and that while
activations from later layers are highly correlated with each other, they
nevertheless play an important role in classification.
Related papers
- Investigating the Robustness and Properties of Detection Transformers
(DETR) Toward Difficult Images [1.5727605363545245]
Transformer-based object detectors (DETR) have shown significant performance across machine vision tasks.
The critical issue to be addressed is how this model architecture can handle different image nuisances.
We studied this issue by measuring the performance of DETR with different experiments and benchmarking the network.
arXiv Detail & Related papers (2023-10-12T23:38:52Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - A Comprehensive Study of Vision Transformers on Dense Prediction Tasks [10.013443811899466]
Convolutional Neural Networks (CNNs) have been the standard choice in vision tasks.
Recent studies have shown that Vision Transformers (VTs) achieve comparable performance in challenging tasks such as object detection and semantic segmentation.
This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks.
arXiv Detail & Related papers (2022-01-21T13:18:16Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.