Visual Transformer Pruning
- URL: http://arxiv.org/abs/2104.08500v2
- Date: Tue, 20 Apr 2021 04:50:49 GMT
- Title: Visual Transformer Pruning
- Authors: Mingjian Zhu, Kai Han, Yehui Tang, Yunhe Wang
- Abstract summary: We present an visual transformer pruning approach, which identifies the impacts of channels in each layer and then executes pruning accordingly.
The pipeline for visual transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning channels; 3) finetuning.
The reduced parameters and FLOPs ratios of the proposed algorithm are well evaluated and analyzed on ImageNet dataset to demonstrate its effectiveness.
- Score: 44.43429237788078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual transformer has achieved competitive performance on a variety of
computer vision applications. However, their storage, run-time memory, and
computational demands are hindering the deployment on mobile devices. Here we
present an visual transformer pruning approach, which identifies the impacts of
channels in each layer and then executes pruning accordingly. By encouraging
channel-wise sparsity in the Transformer, important channels automatically
emerge. A great number of channels with small coefficients can be discarded to
achieve a high pruning ratio without significantly compromising accuracy. The
pipeline for visual transformer pruning is as follows: 1) training with
sparsity regularization; 2) pruning channels; 3) finetuning. The reduced
parameters and FLOPs ratios of the proposed algorithm are well evaluated and
analyzed on ImageNet dataset to demonstrate its effectiveness.
Related papers
- Automatic Channel Pruning for Multi-Head Attention [0.11049608786515838]
We propose an automatic channel pruning method to take into account the multi-head attention mechanism.
On ImageNet-1K, applying our pruning method to the FLattenTransformer, shows outperformed accuracy for several MACs.
arXiv Detail & Related papers (2024-05-31T14:47:20Z) - Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood
Filling [1.0128808054306186]
We propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method.
Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training.
New SPION achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models.
arXiv Detail & Related papers (2023-09-22T02:14:46Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Adaptive Channel Encoding Transformer for Point Cloud Analysis [6.90125287791398]
A channel convolution called Transformer-Conv is designed to encode the channel.
It can encode feature channels by capturing the potential relationship between coordinates and features.
Our method is superior to state-of-the-art point cloud classification and segmentation methods on three benchmark datasets.
arXiv Detail & Related papers (2021-12-05T08:18:00Z) - Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z) - Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.