Super Vision Transformer
- URL: http://arxiv.org/abs/2205.11397v5
- Date: Wed, 19 Jul 2023 08:25:37 GMT
- Title: Super Vision Transformer
- Authors: Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Chunhua Shen, Rongrong Ji,
Liujuan Cao
- Abstract summary: Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
- Score: 131.4777773281238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We attempt to reduce the computational costs in vision transformers (ViTs),
which increase quadratically in the token number. We present a novel training
paradigm that trains only one ViT model at a time, but is capable of providing
improved image recognition performance with various computational costs. Here,
the trained ViT model, termed super vision transformer (SuperViT), is empowered
with the versatile ability to solve incoming patches of multiple sizes as well
as preserve informative tokens with multiple keeping rates (the ratio of
keeping tokens) to achieve good hardware efficiency for inference, given that
the available hardware resources often change from time to time. Experimental
results on ImageNet demonstrate that our SuperViT can considerably reduce the
computational costs of ViT models with even performance increase. For example,
we reduce 2x FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and
0.7% for 1.5x reduction. Also, our SuperViT significantly outperforms existing
studies on efficient vision transformers. For example, when consuming the same
amount of FLOPs, our SuperViT surpasses the recent state-of-the-art (SOTA) EViT
by 1.1% when using DeiT-S as their backbones. The project of this work is made
publicly available at https://github.com/lmbxmu/SuperViT.
Related papers
- DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices [42.89175608336226]
Vision transformer (ViT) has achieved state-of-the-art performance on multiple computer vision benchmarks.
ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices.
We propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs.
arXiv Detail & Related papers (2023-09-10T12:26:17Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer.
It can adapt the pre-trained ViTs into many different image and video tasks efficiently.
It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z) - MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.