A Unified Pruning Framework for Vision Transformers
- URL: http://arxiv.org/abs/2111.15127v1
- Date: Tue, 30 Nov 2021 05:01:02 GMT
- Title: A Unified Pruning Framework for Vision Transformers
- Authors: Hao Yu, Jianxin Wu
- Abstract summary: Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
- Score: 40.7622551128182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, vision transformer (ViT) and its variants have achieved promising
performances in various computer vision tasks. Yet the high computational costs
and training data requirements of ViTs limit their application in
resource-constrained settings. Model compression is an effective method to
speed up deep learning models, but the research of compressing ViTs has been
less explored. Many previous works concentrate on reducing the number of
tokens. However, this line of attack breaks down the spatial structure of ViTs
and is hard to be generalized into downstream tasks. In this paper, we design a
unified framework for structural pruning of both ViTs and its variants, namely
UP-ViTs. Our method focuses on pruning all ViTs components while maintaining
the consistency of the model structure. Abundant experimental results show that
our method can achieve high accuracy on compressed ViTs and variants, e.g.,
UP-DeiT-T achieves 75.79% accuracy on ImageNet, which outperforms the vanilla
DeiT-T by 3.59% with the same computational cost. UP-PVTv2-B0 improves the
accuracy of PVTv2-B0 by 4.83% for ImageNet classification. Meanwhile, UP-ViTs
maintains the consistency of the token representation and gains consistent
improvements on object detection tasks.
Related papers
- Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - DeViT: Decomposing Vision Transformers for Collaborative Inference in
Edge Devices [42.89175608336226]
Vision transformer (ViT) has achieved state-of-the-art performance on multiple computer vision benchmarks.
ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices.
We propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs.
arXiv Detail & Related papers (2023-09-10T12:26:17Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.