Intriguing Properties of Vision Transformers
- URL: http://arxiv.org/abs/2105.10497v1
- Date: Fri, 21 May 2021 17:59:18 GMT
- Title: Intriguing Properties of Vision Transformers
- Authors: Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat,
Fahad Shahbaz Khan, Ming-Hsuan Yang
- Abstract summary: Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
- Score: 114.28522466830374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViT) have demonstrated impressive performance across
various machine vision problems. These models are based on multi-head
self-attention mechanisms that can flexibly attend to a sequence of image
patches to encode contextual cues. An important question is how such
flexibility in attending image-wide context conditioned on a given patch can
facilitate handling nuisances in natural images e.g., severe occlusions, domain
shifts, spatial permutations, adversarial and natural perturbations. We
systematically study this question via an extensive set of experiments
encompassing three ViT families and comparisons with a high-performing
convolutional neural network (CNN). We show and analyze the following
intriguing properties of ViT: (a) Transformers are highly robust to severe
occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1
accuracy on ImageNet even after randomly occluding 80% of the image content.
(b) The robust performance to occlusions is not due to a bias towards local
textures, and ViTs are significantly less biased towards textures compared to
CNNs. When properly trained to encode shape-based features, ViTs demonstrate
shape recognition capability comparable to that of human visual system,
previously unmatched in the literature. (c) Using ViTs to encode shape
representation leads to an interesting consequence of accurate semantic
segmentation without pixel-level supervision. (d) Off-the-shelf features from a
single ViT model can be combined to create a feature ensemble, leading to high
accuracy rates across a range of classification datasets in both traditional
and few-shot learning paradigms. We show effective features of ViTs are due to
flexible and dynamic receptive fields possible via the self-attention
mechanism.
Related papers
- A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches.
Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition.
This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - A Comprehensive Study of Vision Transformers on Dense Prediction Tasks [10.013443811899466]
Convolutional Neural Networks (CNNs) have been the standard choice in vision tasks.
Recent studies have shown that Vision Transformers (VTs) achieve comparable performance in challenging tasks such as object detection and semantic segmentation.
This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks.
arXiv Detail & Related papers (2022-01-21T13:18:16Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Diverse Image Inpainting with Bidirectional and Autoregressive
Transformers [55.21000775547243]
We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT)
BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers.
arXiv Detail & Related papers (2021-04-26T03:52:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.