Training data-efficient image transformers & distillation through
attention
- URL: http://arxiv.org/abs/2012.12877v2
- Date: Fri, 15 Jan 2021 15:52:50 GMT
- Title: Training data-efficient image transformers & distillation through
attention
- Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa,
Alexandre Sablayrolles, Herv\'e J\'egou
- Abstract summary: We produce a competitive convolution-free transformer by training on Imagenet only.
Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1%.
- Score: 93.22667339525832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, neural networks purely based on attention were shown to address
image understanding tasks such as image classification. However, these visual
transformers are pre-trained with hundreds of millions of images using an
expensive infrastructure, thereby limiting their adoption.
In this work, we produce a competitive convolution-free transformer by
training on Imagenet only. We train them on a single computer in less than 3
days. Our reference vision transformer (86M parameters) achieves top-1 accuracy
of 83.1% (single-crop evaluation) on ImageNet with no external data.
More importantly, we introduce a teacher-student strategy specific to
transformers. It relies on a distillation token ensuring that the student
learns from the teacher through attention. We show the interest of this
token-based distillation, especially when using a convnet as a teacher. This
leads us to report results competitive with convnets for both Imagenet (where
we obtain up to 85.2% accuracy) and when transferring to other tasks. We share
our code and models.
Related papers
- Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Co-advise: Cross Inductive Bias Distillation [39.61426495884721]
We propose a novel distillation-based method to train vision transformers.
We introduce lightweight teachers with different architectural inductive biases to co-advise the student transformer.
Our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
arXiv Detail & Related papers (2021-06-23T13:19:59Z) - BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.