QuIP: 2-Bit Quantization of Large Language Models With Guarantees
- URL: http://arxiv.org/abs/2307.13304v2
- Date: Mon, 15 Jan 2024 21:54:28 GMT
- Title: QuIP: 2-Bit Quantization of Large Language Models With Guarantees
- Authors: Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa
- Abstract summary: This work studies post-training parameter quantization in large language models (LLMs)
We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $textitincoherent$ weight and Hessian matrices.
- Score: 44.212441764241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies post-training parameter quantization in large language
models (LLMs). We introduce quantization with incoherence processing (QuIP), a
new method based on the insight that quantization benefits from
$\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being
even in magnitude and the directions in which it is important to round them
accurately being unaligned with the coordinate axes. QuIP consists of two
steps: (1) an adaptive rounding procedure minimizing a quadratic proxy
objective; (2) efficient pre- and post-processing that ensures weight and
Hessian incoherence via multiplication by random orthogonal matrices. We
complement QuIP with the first theoretical analysis for an LLM-scale
quantization algorithm, and show that our theory also applies to an existing
method, OPTQ. Empirically, we find that our incoherence preprocessing improves
several existing quantization algorithms and yields the first LLM quantization
methods that produce viable results using only two bits per weight. Our code
can be found at https://github.com/Cornell-RelaxML/QuIP.
Related papers
- PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models.
PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory.
We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z) - GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [37.66253003964376]
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision.
We introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes.
Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference.
arXiv Detail & Related papers (2024-02-06T20:52:12Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output.
We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z) - Quantum Sparse Coding [5.130440339897477]
We develop a quantum-inspired algorithm for sparse coding.
The emergence of quantum computers and Ising machines can potentially lead to more accurate estimations.
We conduct numerical experiments with simulated data on Lightr's quantum-inspired digital platform.
arXiv Detail & Related papers (2022-09-08T13:00:30Z) - Gradient-descent quantum process tomography by learning Kraus operators [63.69764116066747]
We perform quantum process tomography (QPT) for both discrete- and continuous-variable quantum systems.
We use a constrained gradient-descent (GD) approach on the so-called Stiefel manifold during optimization to obtain the Kraus operators.
The GD-QPT matches the performance of both compressed-sensing (CS) and projected least-squares (PLS) QPT in benchmarks with two-qubit random processes.
arXiv Detail & Related papers (2022-08-01T12:48:48Z) - Least squares binary quantization of neural networks [19.818087225770967]
We focus on the binary quantization, in which values are mapped to -1 and 1.
Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error.
arXiv Detail & Related papers (2020-01-09T00:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.