Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet
- URL: http://arxiv.org/abs/2101.11986v1
- Date: Thu, 28 Jan 2021 13:25:28 GMT
- Title: Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet
- Authors: Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay,
Jiashi Feng, Shuicheng Yan
- Abstract summary: We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
- Score: 128.96032932640364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers, which are popular for language modeling, have been explored for
solving vision tasks recently, e.g., the Vision Transformers (ViT) for image
classification. The ViT model splits each image into a sequence of tokens with
fixed length and then applies multiple Transformer layers to model their global
relation for classification. However, ViT achieves inferior performance
compared with CNNs when trained from scratch on a midsize dataset (e.g.,
ImageNet). We find it is because: 1) the simple tokenization of input images
fails to model the important local structure (e.g., edges, lines) among
neighboring pixels, leading to its low training sample efficiency; 2) the
redundant attention backbone design of ViT leads to limited feature richness in
fixed computation budgets and limited training samples.
To overcome such limitations, we propose a new Tokens-To-Token Vision
Transformers (T2T-ViT), which introduces 1) a layer-wise Tokens-to-Token (T2T)
transformation to progressively structurize the image to tokens by recursively
aggregating neighboring Tokens into one Token (Tokens-to-Token), such that
local structure presented by surrounding tokens can be modeled and tokens
length can be reduced; 2) an efficient backbone with a deep-narrow structure
for vision transformers motivated by CNN architecture design after extensive
study. Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by
200\%, while achieving more than 2.5\% improvement when trained from scratch on
ImageNet. It also outperforms ResNets and achieves comparable performance with
MobileNets when directly training on ImageNet. For example, T2T-ViT with
ResNet50 comparable size can achieve 80.7\% top-1 accuracy on ImageNet. (Code:
https://github.com/yitu-opensource/T2T-ViT)
Related papers
- Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images.
Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Make A Long Image Short: Adaptive Token Length for Vision Transformers [17.21663067385715]
Vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing.
We propose a novel approach to assign token length adaptively during inference.
arXiv Detail & Related papers (2021-12-03T02:48:51Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.