Related papers: eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

URL: http://arxiv.org/abs/2309.00964v2
Date: Wed, 13 Sep 2023 19:17:37 GMT
Title: eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
Authors: Minsik Cho, Keivan A. Vahid, Qichen Fu, Saurabh Adya, Carlo C Del Mundo, Mohammad Rastegari, Devang Naik, Peter Zatloukal
Abstract summary: Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression. We propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes.
Score: 19.502740996431452
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this paper, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that \prjname can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3bit/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130$\times$, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).

Related papers

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M. We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z)
Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference [19.59857352852377]
Large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. We propose Huff-LLM, a method that lets users store LLM weights in compressed format.
arXiv Detail & Related papers (2025-02-02T21:23:42Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression [5.206085750261924]
Large Language Models (LLMs) require significant amount of memory storage in inference. In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
arXiv Detail & Related papers (2024-10-02T14:30:02Z)
Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [1.9639467358416092]
Transformers have emerged as the backbone of large language models (LLMs) We propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2024-03-14T17:59:26Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique. SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.