Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression
- URL: http://arxiv.org/abs/2410.03765v1
- Date: Wed, 2 Oct 2024 14:30:02 GMT
- Title: Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression
- Authors: Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang,
- Abstract summary: Large Language Models (LLMs) require significant amount of memory storage in inference.
In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition.
Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
- Score: 5.206085750261924
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: https://github.com/TUDa-HWAI/Basis_Sharing
Related papers
- LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment [36.958867918858296]
Large language models (LLMs) have demonstrated their strong intelligence ability, but high demand for computation and storage hinders their practical application.
We present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
arXiv Detail & Related papers (2024-10-28T14:45:01Z) - SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter.
We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs.
We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging [14.123313596780726]
We propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA)
MKA uses manifold learning and the Normalized Pairwise Information Bottleneck measure to merge similar layers, reducing model size while preserving essential performance.
Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods.
arXiv Detail & Related papers (2024-06-24T05:57:55Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression [14.818355326032538]
We propose SVD-LLM, a new SVD-based compression method for Large Language Models (LLMs)
SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss.
Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios.
arXiv Detail & Related papers (2024-03-12T07:31:18Z) - Head-wise Shareable Attention for Large Language Models [56.92068213969036]
Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices.
Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop.
We present a perspective on head-wise shareable attention for large language models.
arXiv Detail & Related papers (2024-02-19T04:19:36Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models.
LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods.
LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z) - Condensing Multilingual Knowledge with Lightweight Language-Specific
Modules [52.973832863842546]
We introduce the Language-Specific Matrix Synthesis (LMS) method.
This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices.
We condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique.
arXiv Detail & Related papers (2023-05-23T12:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.