Related papers: QuIP: 2-Bit Quantization of Large Language Models With Guarantees

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

URL: http://arxiv.org/abs/2307.13304v2
Date: Mon, 15 Jan 2024 21:54:28 GMT
Title: QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Authors: Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa
Abstract summary: This work studies post-training parameter quantization in large language models (LLMs) We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $textitincoherent$ weight and Hessian matrices.
Score: 44.212441764241
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-RelaxML/QuIP.

Related papers

PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models. PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory. We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z)
GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z)
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks [37.66253003964376]
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. We introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference.
arXiv Detail & Related papers (2024-02-06T20:52:12Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output. We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z)
Quantum Sparse Coding [5.130440339897477]
We develop a quantum-inspired algorithm for sparse coding. The emergence of quantum computers and Ising machines can potentially lead to more accurate estimations. We conduct numerical experiments with simulated data on Lightr's quantum-inspired digital platform.
arXiv Detail & Related papers (2022-09-08T13:00:30Z)
Gradient-descent quantum process tomography by learning Kraus operators [63.69764116066747]
We perform quantum process tomography (QPT) for both discrete- and continuous-variable quantum systems. We use a constrained gradient-descent (GD) approach on the so-called Stiefel manifold during optimization to obtain the Kraus operators. The GD-QPT matches the performance of both compressed-sensing (CS) and projected least-squares (PLS) QPT in benchmarks with two-qubit random processes.
arXiv Detail & Related papers (2022-08-01T12:48:48Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Least squares binary quantization of neural networks [19.818087225770967]
We focus on the binary quantization, in which values are mapped to -1 and 1. Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error.
arXiv Detail & Related papers (2020-01-09T00:01:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.