USDC: Unified Static and Dynamic Compression for Visual Transformer
- URL: http://arxiv.org/abs/2310.11117v1
- Date: Tue, 17 Oct 2023 10:04:47 GMT
- Title: USDC: Unified Static and Dynamic Compression for Visual Transformer
- Authors: Huan Yuan, Chao Liao, Jianchao Tan, Peng Yao, Jiyuan Jia, Bin Chen,
Chengru Song, Di Zhang
- Abstract summary: Visual Transformers have achieved great success in almost all vision tasks, such as classification, detection, and so on.
However, the model complexity and the inference speed of the visual transformers hinder their deployments in industrial products.
Various model compression techniques focus on directly compressing the visual transformers into a smaller one while maintaining the model performance, however, the performance drops dramatically when the compression ratio is large.
Several dynamic network techniques have also been applied to dynamically compress the visual transformers to obtain input-adaptive efficient sub-structures during the inference stage, which can achieve a better trade-off between the compression ratio and the model performance.
- Score: 17.10536016262485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Transformers have achieved great success in almost all vision tasks,
such as classification, detection, and so on. However, the model complexity and
the inference speed of the visual transformers hinder their deployments in
industrial products. Various model compression techniques focus on directly
compressing the visual transformers into a smaller one while maintaining the
model performance, however, the performance drops dramatically when the
compression ratio is large. Furthermore, several dynamic network techniques
have also been applied to dynamically compress the visual transformers to
obtain input-adaptive efficient sub-structures during the inference stage,
which can achieve a better trade-off between the compression ratio and the
model performance. The upper bound of memory of dynamic models is not reduced
in the practical deployment since the whole original visual transformer model
and the additional control gating modules should be loaded onto devices
together for inference. To alleviate two disadvantages of two categories of
methods, we propose to unify the static compression and dynamic compression
techniques jointly to obtain an input-adaptive compressed model, which can
further better balance the total compression ratios and the model performances.
Moreover, in practical deployment, the batch sizes of the training and
inference stage are usually different, which will cause the model inference
performance to be worse than the model training performance, which is not
touched by all previous dynamic network papers. We propose a sub-group gates
augmentation technique to solve this performance drop problem. Extensive
experiments demonstrate the superiority of our method on various baseline
visual transformers such as DeiT, T2T-ViT, and so on.
Related papers
- Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Progressive Learning with Visual Prompt Tuning for Variable-Rate Image
Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression.
Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively.
Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z) - Modular Transformers: Compressing Transformers into Modularized Layers
for Flexible Efficient Inference [83.01121484432801]
We introduce Modular Transformers, a modularized encoder-decoder framework for flexible sequence-to-sequence model compression.
After a single training phase, Modular Transformers can achieve flexible compression ratios from 1.1x to 6x with little to moderate relative performance drop.
arXiv Detail & Related papers (2023-06-04T15:26:28Z) - Quantization-Aware and Tensor-Compressed Training of Transformers for
Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models.
A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer.
The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z) - COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models [21.07857091998763]
This paper explores an efficient method for compressing vision transformers to enrich the toolset for obtaining compact attention-based vision models.
For compressing DeiT-small and DeiT-base models on ImageNet, our proposed approach can achieve 0.45% and 0.76% higher top-1 accuracy even with fewer parameters.
arXiv Detail & Related papers (2023-05-26T19:50:00Z) - Consolidator: Mergeable Adapter with Grouped Connections for Visual
Adaptation [53.835365470800916]
We show how to efficiently and effectively transfer knowledge in a vision transformer.
We propose consolidator to modify the pre-trained model with the addition of a small set of tunable parameters.
Our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters.
arXiv Detail & Related papers (2023-04-30T23:59:02Z) - Knowledge Distillation in Vision Transformers: A Critical Review [6.508088032296086]
Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs)
Model compression has recently attracted considerable research attention as a potential remedy.
This paper discusses various approaches based upon KD for effective compression of ViT models.
arXiv Detail & Related papers (2023-02-04T06:30:57Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.