ConViT: Improving Vision Transformers with Soft Convolutional Inductive
Biases
- URL: http://arxiv.org/abs/2103.10697v1
- Date: Fri, 19 Mar 2021 09:11:20 GMT
- Title: ConViT: Improving Vision Transformers with Soft Convolutional Inductive
Biases
- Authors: St\'ephane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio
Biroli, Levent Sagun
- Abstract summary: Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification.
We introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias.
The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
- Score: 16.308432111311195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional architectures have proven extremely successful for vision
tasks. Their hard inductive biases enable sample-efficient learning, but come
at the cost of a potentially lower performance ceiling. Vision Transformers
(ViTs) rely on more flexible self-attention layers, and have recently
outperformed CNNs for image classification. However, they require costly
pre-training on large external datasets or distillation from pre-trained
convolutional networks. In this paper, we ask the following question: is it
possible to combine the strengths of these two architectures while avoiding
their respective limitations? To this end, we introduce gated positional
self-attention (GPSA), a form of positional self-attention which can be
equipped with a "soft" convolutional inductive bias. We initialize the GPSA
layers to mimic the locality of convolutional layers, then give each attention
head the freedom to escape locality by adjusting a gating parameter regulating
the attention paid to position versus content information. The resulting
convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet,
while offering a much improved sample efficiency. We further investigate the
role of locality in learning by first quantifying how it is encouraged in
vanilla self-attention layers, then analyzing how it is escaped in GPSA layers.
We conclude by presenting various ablations to better understand the success of
the ConViT. Our code and models are released publicly.
Related papers
- A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Vision Transformers provably learn spatial structure [34.61885883486938]
Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision.
Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns.
arXiv Detail & Related papers (2022-10-13T19:53:56Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.