TinyViT: Fast Pretraining Distillation for Small Vision Transformers
- URL: http://arxiv.org/abs/2207.10666v1
- Date: Thu, 21 Jul 2022 17:59:56 GMT
- Title: TinyViT: Fast Pretraining Distillation for Small Vision Transformers
- Authors: Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong
Fu, Lu Yuan
- Abstract summary: We propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets.
The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data.
- Score: 88.54212027516755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformer (ViT) recently has drawn great attention in computer
vision due to its remarkable model capability. However, most prevailing ViT
models suffer from huge number of parameters, restricting their applicability
on devices with limited resources. To alleviate this issue, we propose TinyViT,
a new family of tiny and efficient small vision transformers pretrained on
large-scale datasets with our proposed fast distillation framework. The central
idea is to transfer knowledge from large pretrained models to small ones, while
enabling small models to get the dividends of massive pretraining data. More
specifically, we apply distillation during pretraining for knowledge transfer.
The logits of large teacher models are sparsified and stored in disk in advance
to save the memory cost and computation overheads. The tiny student
transformers are automatically scaled down from a large pretrained model with
computation and parameter constraints. Comprehensive experiments demonstrate
the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k
with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k
while using 4.2 times fewer parameters. Moreover, increasing image resolutions,
TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using
only 11% parameters. Last but not the least, we demonstrate a good transfer
ability of TinyViT on various downstream tasks. Code and models are available
at https://github.com/microsoft/Cream/tree/main/TinyViT.
Related papers
- How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Escaping the Big Data Paradigm with Compact Transformers [7.697698018200631]
We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets.
Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results.
arXiv Detail & Related papers (2021-04-12T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.