QuIP: 2-Bit Quantization of Large Language Models With Guarantees
- URL: http://arxiv.org/abs/2307.13304v2
- Date: Mon, 15 Jan 2024 21:54:28 GMT
- Title: QuIP: 2-Bit Quantization of Large Language Models With Guarantees
- Authors: Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa
- Abstract summary: This work studies post-training parameter quantization in large language models (LLMs)
We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $textitincoherent$ weight and Hessian matrices.
- Score: 44.212441764241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies post-training parameter quantization in large language
models (LLMs). We introduce quantization with incoherence processing (QuIP), a
new method based on the insight that quantization benefits from
$\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being
even in magnitude and the directions in which it is important to round them
accurately being unaligned with the coordinate axes. QuIP consists of two
steps: (1) an adaptive rounding procedure minimizing a quadratic proxy
objective; (2) efficient pre- and post-processing that ensures weight and
Hessian incoherence via multiplication by random orthogonal matrices. We
complement QuIP with the first theoretical analysis for an LLM-scale
quantization algorithm, and show that our theory also applies to an existing
method, OPTQ. Empirically, we find that our incoherence preprocessing improves
several existing quantization algorithms and yields the first LLM quantization
methods that produce viable results using only two bits per weight. Our code
can be found at https://github.com/Cornell-RelaxML/QuIP.
Related papers
- Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models.
PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory.
We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z) - GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [37.66253003964376]
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision.
We introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes.
Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference.
arXiv Detail & Related papers (2024-02-06T20:52:12Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output.
We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z) - Quantum Sparse Coding [5.130440339897477]
We develop a quantum-inspired algorithm for sparse coding.
The emergence of quantum computers and Ising machines can potentially lead to more accurate estimations.
We conduct numerical experiments with simulated data on Lightr's quantum-inspired digital platform.
arXiv Detail & Related papers (2022-09-08T13:00:30Z) - Gradient-descent quantum process tomography by learning Kraus operators [63.69764116066747]
We perform quantum process tomography (QPT) for both discrete- and continuous-variable quantum systems.
We use a constrained gradient-descent (GD) approach on the so-called Stiefel manifold during optimization to obtain the Kraus operators.
The GD-QPT matches the performance of both compressed-sensing (CS) and projected least-squares (PLS) QPT in benchmarks with two-qubit random processes.
arXiv Detail & Related papers (2022-08-01T12:48:48Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Least squares binary quantization of neural networks [19.818087225770967]
We focus on the binary quantization, in which values are mapped to -1 and 1.
Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error.
arXiv Detail & Related papers (2020-01-09T00:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.