Related papers: ReALLM: A general framework for LLM compression and fine-tuning

ReALLM: A general framework for LLM compression and fine-tuning

URL: http://arxiv.org/abs/2405.13155v1
Date: Tue, 21 May 2024 18:50:51 GMT
Title: ReALLM: A general framework for LLM compression and fine-tuning
Authors: Louis Leconte, Lisa Bedin, Van Minh Nguyen, Eric Moulines,
Abstract summary: ReALLM is a novel approach for compression and memory-efficient adaptation of pre-trained language models. Weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training.
Score: 11.738510106847414
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $\mathcal{D}_\phi$ with its weights on $b_\phi$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training. With a budget of $2$ bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.

Related papers

OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z)
Matrix Compression via Randomized Low Rank and Low Precision Factorization [47.902465710511485]
Modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix $mathbfA$ as $mathbfLmathbfR$. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-$7$b.
arXiv Detail & Related papers (2023-10-17T06:56:57Z)
Learning Low-Rank Representations for Model Compression [6.721845345130468]
We propose a Low-Rank Representation Vector Quantization ($textLR2textVQ$) method that outperforms previous VQ algorithms in various tasks and architectures. In our method, the compression ratio could be directly controlled by $m$, and the final accuracy is solely determined by $tilded$. With a proper $tilded$, we evaluate $textLR2textVQ$ with ResNet-18/ResNet-50 on ImageNet classification datasets, achieving 2.8%/1.0% top
arXiv Detail & Related papers (2022-11-21T12:15:28Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
Efficient Decoding of Surface Code Syndromes for Error Correction in Quantum Computing [0.09236074230806578]
We propose a two-level (low and high) ML-based decoding scheme, where the first level corrects errors on physical qubits and the second one corrects any existing logical errors. Our results show that our proposed decoding method achieves $sim10 times$ and $sim2 times$ higher values of pseudo-threshold and threshold respectively. We show that usage of more sophisticated ML models with higher training/testing time, do not provide significant improvement in the decoder performance.
arXiv Detail & Related papers (2021-10-21T04:54:44Z)
Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$. For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size. For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.