Going deeper with Image Transformers
- URL: http://arxiv.org/abs/2103.17239v1
- Date: Wed, 31 Mar 2021 17:37:32 GMT
- Title: Going deeper with Image Transformers
- Authors: Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve,
Herv\'e J\'egou
- Abstract summary: We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
- Score: 102.61950708108022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have been recently adapted for large scale image classification,
achieving high scores shaking up the long supremacy of convolutional neural
networks. However the optimization of image transformers has been little
studied so far. In this work, we build and optimize deeper transformer networks
for image classification. In particular, we investigate the interplay of
architecture and optimization of such dedicated transformers. We make two
transformers architecture changes that significantly improve the accuracy of
deep transformers. This leads us to produce models whose performance does not
saturate early with more depth, for instance we obtain 86.3% top-1 accuracy on
Imagenet when training with no external data. Our best model establishes the
new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match
frequency, in the setting with no additional training data.
Related papers
- SpectFormer: Frequency and Attention is what you need in a Vision
Transformer [28.01996628113975]
Vision transformers have been applied successfully for image recognition tasks.
We hypothesize that both spectral and multi-headed attention plays a major role.
We propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers.
arXiv Detail & Related papers (2023-04-13T12:27:17Z) - Boosting vision transformers for image retrieval [11.441395750267052]
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection.
However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks.
We propose a number of improvements that make transformers outperform the state of the art for the first time.
arXiv Detail & Related papers (2022-10-21T12:17:12Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding [62.502694656615496]
We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
arXiv Detail & Related papers (2022-08-25T17:59:29Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.