Vision Transformers provably learn spatial structure
- URL: http://arxiv.org/abs/2210.09221v1
- Date: Thu, 13 Oct 2022 19:53:56 GMT
- Title: Vision Transformers provably learn spatial structure
- Authors: Samy Jelassi, Michael E. Sander, Yuanzhi Li
- Abstract summary: Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision.
Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns.
- Score: 34.61885883486938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have achieved comparable or superior performance
than Convolutional Neural Networks (CNNs) in computer vision. This empirical
breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not
embed any visual inductive bias of spatial locality. Yet, recent works have
shown that while minimizing their training loss, ViTs specifically learn
spatially localized patterns. This raises a central question: how do ViTs learn
these patterns by solely minimizing their training loss using gradient-based
methods from random initialization? In this paper, we provide some theoretical
justification of this phenomenon. We propose a spatially structured dataset and
a simplified ViT model. In this model, the attention matrix solely depends on
the positional encodings. We call this mechanism the positional attention
mechanism. On the theoretical side, we consider a binary classification task
and show that while the learning problem admits multiple solutions that
generalize, our model implicitly learns the spatial structure of the dataset
while generalizing: we call this phenomenon patch association. We prove that
patch association helps to sample-efficiently transfer to downstream datasets
that share the same structure as the pre-training one but differ in the
features. Lastly, we empirically verify that a ViT with positional attention
performs similarly to the original one on CIFAR-10/100, SVHN and ImageNet.
Related papers
- Structured Initialization for Attention in Vision Transformers [34.374054040300805]
convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on small-scale problems.
We argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT.
This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications.
arXiv Detail & Related papers (2024-04-01T14:34:47Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - ConViT: Improving Vision Transformers with Soft Convolutional Inductive
Biases [16.308432111311195]
Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification.
We introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias.
The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
arXiv Detail & Related papers (2021-03-19T09:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.