Do Vision Transformers See Like Convolutional Neural Networks?
- URL: http://arxiv.org/abs/2108.08810v1
- Date: Thu, 19 Aug 2021 17:27:03 GMT
- Title: Do Vision Transformers See Like Convolutional Neural Networks?
- Authors: Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang,
Alexey Dosovitskiy
- Abstract summary: Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
- Score: 45.69780772718875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) have so far been the de-facto model for
visual data. Recent work has shown that (Vision) Transformer models (ViT) can
achieve comparable or even superior performance on image classification tasks.
This raises a central question: how are Vision Transformers solving these
tasks? Are they acting like convolutional networks, or learning entirely
different visual representations? Analyzing the internal representation
structure of ViTs and CNNs on image classification benchmarks, we find striking
differences between the two architectures, such as ViT having more uniform
representations across all layers. We explore how these differences arise,
finding crucial roles played by self-attention, which enables early aggregation
of global information, and ViT residual connections, which strongly propagate
features from lower to higher layers. We study the ramifications for spatial
localization, demonstrating ViTs successfully preserve input spatial
information, with noticeable effects from different classification methods.
Finally, we study the effect of (pretraining) dataset scale on intermediate
features and transfer learning, and conclude with a discussion on connections
to new architectures such as the MLP-Mixer.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Vision Transformers provably learn spatial structure [34.61885883486938]
Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision.
Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns.
arXiv Detail & Related papers (2022-10-13T19:53:56Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Vision Transformer for Contrastive Clustering [48.476602271481674]
Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN)
This paper presents an end-to-end deep image clustering approach termed Vision Transformer for Contrastive Clustering (VTCC)
arXiv Detail & Related papers (2022-06-26T17:00:35Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.