Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank
Compression Strategy
- URL: http://arxiv.org/abs/2402.06004v1
- Date: Thu, 8 Feb 2024 19:01:14 GMT
- Title: Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank
Compression Strategy
- Authors: Seyedarmin Azizi, Mahdi Nazemi, Massoud Pedram
- Abstract summary: This paper introduces an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs.
The presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset.
In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain.
- Score: 5.699098817569033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Vision Transformers (ViTs) increasingly set new benchmarks in computer
vision, their practical deployment on inference engines is often hindered by
their significant memory bandwidth and (on-chip) memory footprint requirements.
This paper addresses this memory limitation by introducing an activation-aware
model compression methodology that uses selective low-rank weight tensor
approximations of different layers to reduce the parameter count of ViTs. The
key idea is to decompose the weight tensors into a sum of two
parameter-efficient tensors while minimizing the error between the product of
the input activations with the original weight tensor and the product of the
input activations with the approximate tensor sum. This approximation is
further refined by adopting an efficient layer-wise error compensation
technique that uses the gradient of the layer's output loss. The combination of
these techniques achieves excellent results while it avoids being trapped in a
shallow local minimum early in the optimization process and strikes a good
balance between the model compression and output accuracy. Notably, the
presented method significantly reduces the parameter count of DeiT-B by 60%
with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual
accuracy degradation seen in low-rank approximations. In addition to this, the
presented compression technique can compress large DeiT/ViT models to have
about the same model size as smaller DeiT/ViT variants while yielding up to
1.8% accuracy gain. These results highlight the efficacy of our approach,
presenting a viable solution for embedding ViTs in memory-constrained
environments without compromising their performance.
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition [4.375744277719009]
LORTSAR is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer"
Our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy.
This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.
arXiv Detail & Related papers (2024-07-19T20:19:41Z) - Effective Interplay between Sparsity and Quantization: From Theory to Practice [33.697590845745815]
Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy.
We investigate the interaction between these two methods and assess whether their combination impacts final model accuracy.
Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost.
arXiv Detail & Related papers (2024-05-31T15:34:13Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - Error Feedback Can Accurately Compress Preconditioners [43.60787513716217]
Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of currents for deep learning.
Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to small-scale models.
In this paper, we address this issue via a novel and efficient error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence.
arXiv Detail & Related papers (2023-06-09T17:58:47Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Multi-Dimensional Model Compression of Vision Transformer [21.8311401851523]
Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment.
Previous ViT pruning methods tend to prune the model along one dimension solely.
We advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly.
arXiv Detail & Related papers (2021-12-31T19:54:18Z) - Compression-aware Projection with Greedy Dimension Reduction for
Convolutional Neural Network Activations [3.6188659868203388]
We propose a compression-aware projection system to improve the trade-off between classification accuracy and compression ratio.
Our test results show that the proposed methods effectively reduce 2.91x5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.
arXiv Detail & Related papers (2021-10-17T14:02:02Z) - Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features.
We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.