Unified Visual Transformer Compression
- URL: http://arxiv.org/abs/2203.08243v1
- Date: Tue, 15 Mar 2022 20:38:22 GMT
- Title: Unified Visual Transformer Compression
- Authors: Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen
Yang, Ji Liu, Zhangyang Wang
- Abstract summary: This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
- Score: 102.26265546836329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have gained popularity recently. Even without
customized image operators such as convolutions, ViTs can yield competitive
performance when properly trained on massive data. However, the computational
overhead of ViTs remains prohibitive, due to stacking multi-head self-attention
modules and else. Compared to the vast literature and prevailing success in
compressing convolutional neural networks, the study of Vision Transformer
compression has also just emerged, and existing works focused on one or two
aspects of compression. This paper proposes a unified ViT compression framework
that seamlessly assembles three effective techniques: pruning, layer skipping,
and knowledge distillation. We formulate a budget-constrained, end-to-end
optimization framework, targeting jointly learning model weights, layer-wise
pruning ratios/masks, and skip configurations, under a distillation loss. The
optimization problem is then solved using the primal-dual algorithm.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT
backbones on the ImageNet dataset, and our approach consistently outperforms
recent competitors. For example, DeiT-Tiny can be trimmed down to 50\% of the
original FLOPs almost without losing accuracy. Codes are available
online:~\url{https://github.com/VITA-Group/UVC}.
Related papers
- DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.