Related papers: VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning

VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning

URL: http://arxiv.org/abs/2406.05276v2
Date: Tue, 11 Jun 2024 23:11:43 GMT
Title: VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
Authors: Oshin Dutta, Ritvik Gupta, Sumeet Agarwal,
Abstract summary: We propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle. Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks. Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches.
Score: 3.256420760342604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, there has been a growing emphasis on compressing large pre-trained transformer models for resource-constrained devices. However, traditional pruning methods often leave the embedding layer untouched, leading to model over-parameterization. Additionally, they require extensive compression time with large datasets to maintain performance in pruned models. To address these challenges, we propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle. Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks. This approach retains only essential weights in each layer, ensuring compliance with specified model size or computational constraints. Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches, both task-agnostic and task-specific. We further propose faster variants of our method: Fast-VTrans utilizing only 3% of the data and Faster-VTrans, a time efficient alternative that involves exclusive finetuning of VIB masks, accelerating compression by upto 25 times with minimal performance loss compared to previous methods. Extensive experiments on BERT, ROBERTa, and GPT-2 models substantiate the efficacy of our method. Moreover, our method demonstrates scalability in compressing large models such as LLaMA-2-7B, achieving superior performance compared to previous pruning methods. Additionally, we use attention-based probing to qualitatively assess model redundancy and interpret the efficiency of our approach. Notably, our method considers heads with high attention to special and current tokens in un-pruned model as foremost candidates for pruning while retained heads are observed to attend more to task-critical keywords.

Related papers

Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning [38.7577454874686]
InfoPrune is an information-theoretic framework for adaptive structural compression of vision-language models.<n>Experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation.
arXiv Detail & Related papers (2025-11-24T03:37:14Z)
FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models [59.8871829077739]
FastFit is a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture.<n>Our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead.<n>This allows reference features to be computed only once and losslessly reused across all steps.
arXiv Detail & Related papers (2025-08-28T09:25:52Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP) ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run. We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning [20.62274005080048]
PruneNet is a novel model compression method that reformulates model pruning as a policy learning process. It can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance. On complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model.
arXiv Detail & Related papers (2025-01-25T18:26:39Z)
Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement [9.454314879815337]
generative models often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance. We introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. SVS improves compression performance across model types without additional training costs.
arXiv Detail & Related papers (2024-12-23T08:40:08Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z)
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection [14.073722038551125]
KV cache has become a de facto technique for the inference of large language models. This paper uses low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We find that our method can sustain over 90% performance with an average KV cache compression rate of 60%.
arXiv Detail & Related papers (2024-10-16T08:34:51Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy [5.699098817569033]
This paper introduces an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain.
arXiv Detail & Related papers (2024-02-08T19:01:14Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing. It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z)
Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning) GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models. Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
arXiv Detail & Related papers (2022-12-15T06:52:31Z)
Extreme Compression for Pre-trained Transformers Made Simple and Efficient [31.719905773863566]
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. We propose a simple yet effective compression pipeline for extreme compression, named XTC.
arXiv Detail & Related papers (2022-06-04T00:19:45Z)
DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization [75.72231742114951]
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
arXiv Detail & Related papers (2022-03-21T18:04:25Z)
Multi-Dimensional Model Compression of Vision Transformer [21.8311401851523]
Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dimension solely. We advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly.
arXiv Detail & Related papers (2021-12-31T19:54:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.