Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet
- URL: http://arxiv.org/abs/2104.10858v2
- Date: Fri, 23 Apr 2021 08:50:56 GMT
- Title: Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet
- Authors: Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Xiaojie Jin, Anran
Wang, Jiashi Feng
- Abstract summary: We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
- Score: 86.95679590801494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper provides a strong baseline for vision transformers on the ImageNet
classification task. While recent vision transformers have demonstrated
promising results in ImageNet classification, their performance still lags
behind powerful convolutional neural networks (CNNs) with approximately the
same model size. In this work, instead of describing a novel transformer
architecture, we explore the potential of vision transformers in ImageNet
classification by developing a bag of training techniques. We show that by
slightly tuning the structure of vision transformers and introducing token
labeling -- a new training objective, our models are able to achieve better
results than the CNN counterparts and other transformer-based classification
models with similar amount of training parameters and computations. Taking a
vision transformer with 26M learnable parameters as an example, we can achieve
an 84.4% Top-1 accuracy on ImageNet. When the model size is scaled up to
56M/150M, the result can be further increased to 85.4%/86.2% without extra
data. We hope this study could provide researchers with useful techniques to
train powerful vision transformers. Our code and all the training details will
be made publicly available at https://github.com/zihangJiang/TokenLabeling.
Related papers
- Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Aggregating Nested Transformers [42.96279765218623]
We explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner.
We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication.
Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization.
arXiv Detail & Related papers (2021-05-26T17:56:48Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z) - Training data-efficient image transformers & distillation through
attention [93.22667339525832]
We produce a competitive convolution-free transformer by training on Imagenet only.
Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1%.
arXiv Detail & Related papers (2020-12-23T18:42:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.