SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
- URL: http://arxiv.org/abs/2306.03078v1
- Date: Mon, 5 Jun 2023 17:53:28 GMT
- Title: SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
- Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev,
Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan
Alistarh
- Abstract summary: We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
- Score: 76.73007709690306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language model (LLM) pretraining have led to
high-quality LLMs with impressive abilities. By compressing such LLMs via
quantization to 3-4 bits per parameter, they can fit into memory-limited
devices such as laptops and mobile phones, enabling personalized use. However,
quantization down to 3-4 bits per parameter usually leads to moderate-to-high
accuracy losses, especially for smaller models in the 1-10B parameter range,
which are well-suited for edge deployments. To address this accuracy issue, we
introduce the Sparse-Quantized Representation (SpQR), a new compressed format
and quantization technique which enables for the first time near-lossless
compression of LLMs across model scales, while reaching similar compression
levels to previous methods. SpQR works by identifying and isolating outlier
weights, which cause particularly-large quantization errors, and storing them
in higher precision, while compressing all other weights to 3-4 bits, and
achieves relative accuracy losses of less than 1% in perplexity for
highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B
parameter LLM on a single 24 GB consumer GPU without any performance
degradation at 15% speedup thus making powerful LLMs available to consumer
without any downsides. SpQR comes with efficient algorithms for both encoding
weights into its format, as well as decoding them efficiently at runtime.
Specifically, we provide an efficient GPU inference algorithm for SpQR which
yields faster inference than 16-bit baselines at similar accuracy, while
enabling memory compression gains of more than 4x.
Related papers
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - FlattenQuant: Breaking Through the Inference Compute-bound for Large
Language Models with Per-tensor Quantization [6.931020818874328]
We introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss.
Our work achieves up to 2$times$ speedup and 2.3$times$ memory reduction for LLMs with negligible loss in accuracy.
arXiv Detail & Related papers (2024-02-28T02:00:34Z) - FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models
for Financial Applications with High-Performance Computing [10.47214968497857]
We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models.
Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop.
For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
arXiv Detail & Related papers (2024-02-21T05:03:17Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
AQLM is first scheme that is optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter.
We provide fast GPU and CPU implementations of AQLM for token generation.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - SmoothQuant+: Accurate and Efficient 4-bit Post-Training
WeightQuantization for LLM [13.035063417593534]
Large language models (LLMs) have shown remarkable capabilities in various tasks.
Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs.
We propose SmoothQuant+, an accurate and efficient 4-bit weight-only PTQ.
arXiv Detail & Related papers (2023-12-06T11:10:55Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.