SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
- URL: http://arxiv.org/abs/2306.03078v1
- Date: Mon, 5 Jun 2023 17:53:28 GMT
- Title: SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
- Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev,
Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan
Alistarh
- Abstract summary: We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
- Score: 76.73007709690306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language model (LLM) pretraining have led to
high-quality LLMs with impressive abilities. By compressing such LLMs via
quantization to 3-4 bits per parameter, they can fit into memory-limited
devices such as laptops and mobile phones, enabling personalized use. However,
quantization down to 3-4 bits per parameter usually leads to moderate-to-high
accuracy losses, especially for smaller models in the 1-10B parameter range,
which are well-suited for edge deployments. To address this accuracy issue, we
introduce the Sparse-Quantized Representation (SpQR), a new compressed format
and quantization technique which enables for the first time near-lossless
compression of LLMs across model scales, while reaching similar compression
levels to previous methods. SpQR works by identifying and isolating outlier
weights, which cause particularly-large quantization errors, and storing them
in higher precision, while compressing all other weights to 3-4 bits, and
achieves relative accuracy losses of less than 1% in perplexity for
highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B
parameter LLM on a single 24 GB consumer GPU without any performance
degradation at 15% speedup thus making powerful LLMs available to consumer
without any downsides. SpQR comes with efficient algorithms for both encoding
weights into its format, as well as decoding them efficiently at runtime.
Specifically, we provide an efficient GPU inference algorithm for SpQR which
yields faster inference than 16-bit baselines at similar accuracy, while
enabling memory compression gains of more than 4x.
Related papers
- COMET: Towards Partical W4A4KV4 LLMs Serving [37.30529940231099]
Quantization is a compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers.
We propose a novel mixed-precision quantization algorithm (FMPQ) that compresses most activations into 4-bit with negligible accuracy loss.
We integrate the optimized W4Ax kernel into our inference framework, COMET, and provide efficient management to support popular LLMs.
arXiv Detail & Related papers (2024-10-16T02:16:53Z) - SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators [25.229269944770678]
Large Language Models (LLMs) have transformed natural language processing, but face challenges in widespread deployment due to their high runtime cost.
We introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights.
arXiv Detail & Related papers (2024-10-14T16:57:23Z) - SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs [2.7624021966289605]
Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks.
LLMs suffer from high memory consumption and slow inference times due to their large parameter sizes.
This paper introduces SLiM, a novel approach for compressing LLMs using a one-shot Quantized Sparse Plus Low-rank Approximation.
arXiv Detail & Related papers (2024-10-12T18:36:07Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - SmoothQuant+: Accurate and Efficient 4-bit Post-Training
WeightQuantization for LLM [13.035063417593534]
Large language models (LLMs) have shown remarkable capabilities in various tasks.
Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs.
We propose SmoothQuant+, an accurate and efficient 4-bit weight-only PTQ.
arXiv Detail & Related papers (2023-12-06T11:10:55Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.