ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
- URL: http://arxiv.org/abs/2312.05821v4
- Date: Tue, 29 Oct 2024 12:28:58 GMT
- Title: ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
- Authors: Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, Guangyu Sun,
- Abstract summary: We introduce a new post-training compression paradigm for Large Language Models (LLMs)
We find that the challenges of this task stem from the distribution variance in the LLM activations and the sensitivity difference among various kinds of layers.
We propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD)
- Score: 28.231997641388343
- License:
- Abstract: In this paper, we introduce a new post-training compression paradigm for Large Language Models (LLMs) to facilitate their wider adoption. We delve into LLM weight low-rank decomposition, and find that the challenges of this task stem from the distribution variance in the LLM activations and the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by transforming the weight matrix based on the activation distribution. This transformation allows the outliers in the activation matrix to be absorbed into the transformed weight matrix, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. In this way, ASVD can compress a network by 10%-30%. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner. Code is anonymously available in supplementary materials.
Related papers
- Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives [59.46211685419206]
We argue that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance.
We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.
arXiv Detail & Related papers (2025-02-04T21:17:51Z) - AdaSVD: Adaptive Singular Value Decomposition for Large Language Models [84.60646883395454]
Singular Value Decomposition (SVD) has emerged as a promising compression technique for large language models (LLMs)
Existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation.
We propose AdaSVD, an adaptive SVD-based LLM compression approach.
arXiv Detail & Related papers (2025-02-03T14:34:37Z) - Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation [53.88562288388169]
A common strategy for.
Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks.
We propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix.
SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
arXiv Detail & Related papers (2024-10-30T12:08:30Z) - AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations [36.63586957377984]
Large language models often require substantial storage space.
Due to their massive parameter count, these models often require substantial storage space.
One research direction proposes to compress the models using integer replacements for floating-point numbers.
arXiv Detail & Related papers (2024-10-17T04:35:57Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.