CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity
Prediction
- URL: http://arxiv.org/abs/2203.04570v1
- Date: Wed, 9 Mar 2022 08:15:14 GMT
- Title: CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity
Prediction
- Authors: Zhuoran Song, Yihong Xu, Zhezhi He, Li Jiang, Naifeng Jing, and
Xiaoyao Liang
- Abstract summary: Vision transformer (ViT) has achieved competitive accuracy on a variety of computer vision applications, but its computational cost impedes the deployment on resource-limited mobile devices.
We propose a cascade pruning framework named CP-ViT by predicting sparsity in ViT models progressively and dynamically to reduce computational redundancy while minimizing the accuracy loss.
- Score: 16.578899848650675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformer (ViT) has achieved competitive accuracy on a variety of
computer vision applications, but its computational cost impedes the deployment
on resource-limited mobile devices.
We explore the sparsity in ViT and observe that informative patches and heads
are sufficient for accurate image recognition.
In this paper, we propose a cascade pruning framework named CP-ViT by
predicting sparsity in ViT models progressively and dynamically to reduce
computational redundancy while minimizing the accuracy loss. Specifically, we
define the cumulative score to reserve the informative patches and heads across
the ViT model for better accuracy. We also propose the dynamic pruning ratio
adjustment technique based on layer-aware attention range. CP-ViT has great
general applicability for practical deployment, which can be applied to a wide
range of ViT models and can achieve superior accuracy with or without
fine-tuning.
Extensive experiments on ImageNet, CIFAR-10, and CIFAR-100 with various
pre-trained models have demonstrated the effectiveness and efficiency of
CP-ViT. By progressively pruning 50\% patches, our CP-ViT method reduces over
40\% FLOPs while maintaining accuracy loss within 1\%.
Related papers
- TReX- Reusing Vision Transformer's Attention for Efficient Xbar-based Computing [12.583079680322156]
We propose TReX, an attention-reuse-driven ViT optimization framework.
We find that TReX achieves 2.3x (2.19x) EDAP reduction and 1.86x (1.79x) TOPS/mm2 improvement with 1% accuracy drop in case of DeiT-S (LV-ViT-S) ViT models.
On NLP tasks such as CoLA, TReX leads to 2% higher non-ideal accuracy compared to baseline at 1.6x lower EDAP.
arXiv Detail & Related papers (2024-08-22T21:51:38Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices [42.89175608336226]
Vision transformer (ViT) has achieved state-of-the-art performance on multiple computer vision benchmarks.
ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices.
We propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs.
arXiv Detail & Related papers (2023-09-10T12:26:17Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models [22.055655390093722]
Correlation Aware Pruner (CAP) significantly pushes the compressibility limits for state-of-the-art architectures.
New theoretically-justified pruner handles complex weight correlations accurately and efficiently during the pruning process itself.
We show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.
arXiv Detail & Related papers (2022-10-14T12:19:09Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.