Chasing Sparsity in Vision Transformers: An End-to-End Exploration
- URL: http://arxiv.org/abs/2106.04533v2
- Date: Wed, 9 Jun 2021 22:28:31 GMT
- Title: Chasing Sparsity in Vision Transformers: An End-to-End Exploration
- Authors: Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang
- Abstract summary: Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting.
This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy.
Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
- Score: 127.10054032751714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have recently received explosive popularity, but
their enormous model sizes and training costs remain daunting. Conventional
post-training pruning often incurs higher training budgets. In contrast, this
paper aims to trim down both the training memory overhead and the inference
complexity, without sacrificing the achievable accuracy. We launch and report
the first-of-its-kind comprehensive exploration, on taking a unified approach
of integrating sparsity in ViTs "from end to end". Specifically, instead of
training full ViTs, we dynamically extract and train sparse subnetworks, while
sticking to a fixed small parameter budget. Our approach jointly optimizes
model parameters and explores connectivity throughout training, ending up with
one sparse network as the final output. The approach is seamlessly extended
from unstructured to structured sparsity, the latter by considering to guide
the prune-and-grow of self-attention heads inside ViTs. For additional
efficiency gains, we further co-explore data and architecture sparsity, by
plugging in a novel learnable token selector to adaptively determine the
currently most vital patches. Extensive results on ImageNet with diverse ViT
backbones validate the effectiveness of our proposals which obtain
significantly reduced computational cost and almost unimpaired generalization.
Perhaps most surprisingly, we find that the proposed sparse (co-)training can
even improve the ViT accuracy rather than compromising it, making sparsity a
tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%)
sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile
enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at
https://github.com/VITA-Group/SViTE.
Related papers
- Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Sparsity Winning Twice: Better Robust Generalization from More Efficient
Training [94.92954973680914]
We introduce two alternatives for sparse adversarial training: (i) static sparsity and (ii) dynamic sparsity.
We find both methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting.
Our approaches can be combined with existing regularizers, establishing new state-of-the-art results in adversarial training.
arXiv Detail & Related papers (2022-02-20T15:52:08Z) - Pyramid Adversarial Training Improves ViT Performance [43.322865996422664]
Pyramid Adversarial Training is a simple and effective technique to improve ViT's overall performance.
It leads to $1.82%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data.
arXiv Detail & Related papers (2021-11-30T04:38:14Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.