CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs
- URL: http://arxiv.org/abs/2309.15755v1
- Date: Wed, 27 Sep 2023 16:12:07 GMT
- Title: CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs
- Authors: Ao Wang, Hui Chen, Zijia Lin, Sicheng Zhao, Jungong Han, Guiguang Ding
- Abstract summary: Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
- Score: 79.54107547233625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have emerged as state-of-the-art models for
various vision tasks recently. However, their heavy computation costs remain
daunting for resource-limited devices. Consequently, researchers have dedicated
themselves to compressing redundant information in ViTs for acceleration.
However, they generally sparsely drop redundant image tokens by token pruning
or brutally remove channels by channel pruning, leading to a sub-optimal
balance between model performance and inference speed. They are also
disadvantageous in transferring compressed models to downstream vision tasks
that require the spatial structure of images, such as semantic segmentation. To
tackle these issues, we propose a joint compression method for ViTs that offers
both high accuracy and fast inference speed, while also maintaining favorable
transferability to downstream tasks (CAIT). Specifically, we introduce an
asymmetric token merging (ATME) strategy to effectively integrate neighboring
tokens. It can successfully compress redundant token information while
preserving the spatial structure of images. We further employ a consistent
dynamic channel pruning (CDCP) strategy to dynamically prune unimportant
channels in ViTs. Thanks to CDCP, insignificant channels in multi-head
self-attention modules of ViTs can be pruned uniformly, greatly enhancing the
model compression. Extensive experiments on benchmark datasets demonstrate that
our proposed method can achieve state-of-the-art performance across various
ViTs. For example, our pruned DeiT-Tiny and DeiT-Small achieve speedups of
1.7$\times$ and 1.9$\times$, respectively, without accuracy drops on ImageNet.
On the ADE20k segmentation dataset, our method can enjoy up to 1.31$\times$
speedups with comparable mIoU. Our code will be publicly available.
Related papers
- Accelerating Vision Diffusion Transformers with Skip Branches [46.19946204953147]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture.
DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.
We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness.
We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision
Transformers [15.108494142240993]
Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks.
High computational complexity hinders ViTs' applicability on devices with limited memory and computing resources.
We propose a novel channel shuffle module to improve tiny-size ViTs.
arXiv Detail & Related papers (2023-10-09T11:56:35Z) - DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.