Related papers: Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet

Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet

URL: http://arxiv.org/abs/2104.10858v2
Date: Fri, 23 Apr 2021 08:50:56 GMT
Title: Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet
Authors: Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Xiaojie Jin, Anran Wang, Jiashi Feng
Abstract summary: We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
Score: 86.95679590801494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper provides a strong baseline for vision transformers on the ImageNet classification task. While recent vision transformers have demonstrated promising results in ImageNet classification, their performance still lags behind powerful convolutional neural networks (CNNs) with approximately the same model size. In this work, instead of describing a novel transformer architecture, we explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling -- a new training objective, our models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations. Taking a vision transformer with 26M learnable parameters as an example, we can achieve an 84.4% Top-1 accuracy on ImageNet. When the model size is scaled up to 56M/150M, the result can be further increased to 85.4%/86.2% without extra data. We hope this study could provide researchers with useful techniques to train powerful vision transformers. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

Related papers

Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification. We find Vision Transformers perform poorly on a semi-supervised ImageNet setting. CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z)
CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs) We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z)
Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute. We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
Aggregating Nested Transformers [42.96279765218623]
We explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization.
arXiv Detail & Related papers (2021-05-26T17:56:48Z)
Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z)
Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z)
Training data-efficient image transformers & distillation through attention [93.22667339525832]
We produce a competitive convolution-free transformer by training on Imagenet only. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1%.
arXiv Detail & Related papers (2020-12-23T18:42:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.