eDKM: An Efficient and Accurate Train-time Weight Clustering for Large
Language Models
- URL: http://arxiv.org/abs/2309.00964v2
- Date: Wed, 13 Sep 2023 19:17:37 GMT
- Title: eDKM: An Efficient and Accurate Train-time Weight Clustering for Large
Language Models
- Authors: Minsik Cho, Keivan A. Vahid, Qichen Fu, Saurabh Adya, Carlo C Del
Mundo, Mohammad Rastegari, Devang Naik, Peter Zatloukal
- Abstract summary: Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression.
We propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes.
- Score: 19.502740996431452
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Since Large Language Models or LLMs have demonstrated high-quality
performance on many complex language tasks, there is a great interest in
bringing these LLMs to mobile devices for faster responses and better privacy
protection. However, the size of LLMs (i.e., billions of parameters) requires
highly effective compression to fit into storage-limited devices. Among many
compression techniques, weight-clustering, a form of non-linear quantization,
is one of the leading candidates for LLM compression, and supported by modern
smartphones. Yet, its training overhead is prohibitively significant for LLM
fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown
the state-of-the-art trade-off between compression ratio and accuracy
regression, but its large memory complexity makes it nearly impossible to apply
to train-time LLM compression. In this paper, we propose a memory-efficient DKM
implementation, eDKM powered by novel techniques to reduce the memory footprint
of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the
backward pass of DKM, we compressed the tensor by applying uniquification and
sharding after checking if there is no duplicated tensor previously copied to
CPU. Our experimental results demonstrate that \prjname can fine-tune and
compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3bit/weight) with
the Alpaca dataset by reducing the train-time memory footprint of a decoder
layer by 130$\times$, while delivering good accuracy on broader LLM benchmarks
(i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression [5.206085750261924]
Large Language Models (LLMs) require significant amount of memory storage in inference.
In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition.
Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
arXiv Detail & Related papers (2024-10-02T14:30:02Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [1.9639467358416092]
Transformers have emerged as the backbone of large language models (LLMs)
We propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time.
We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2024-03-14T17:59:26Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models.
LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods.
LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.