Related papers: The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

URL: http://arxiv.org/abs/2507.18553v2
Date: Wed, 01 Oct 2025 13:25:12 GMT
Title: The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
Authors: Jiale Chen, Yalda Shabanzadeh, Elvir Crnčević, Torsten Hoefler, Dan Alistarh,
Abstract summary: We show that GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem.<n>We design post-training quantization methods that avoid clipping, and outperform the original GPTQ.
Score: 46.167267094420644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

Related papers

Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs [56.246783503873225]
This paper revisits the weighted strategy for non-stationary parametric bandits.<n>We propose a simpler weight-based algorithm that is as efficient as window/restart-based algorithms.<n>Our framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2026-01-03T04:50:21Z)
The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's algorithm [0.0]
We show how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data.<n>We prove that the GPTQ algorithm is equivalent to Babai's well-known nearest-plane algorithm.
arXiv Detail & Related papers (2025-08-01T21:20:58Z)
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration [21.474315621757594]
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures.<n>Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model.<n>GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization.
arXiv Detail & Related papers (2025-04-03T15:30:43Z)
Graphical Stabilizer Decompositions for Multi-Control Toffoli Gate Dense Quantum Circuits [0.0]
We study concepts in quantum computing using graphical languages, specifically using the ZX-calculus.<n>The first major focus is on the decomposition of non-stabilizer states created from star edges.<n>The second major focus is on weighting algorithms, applied to the special class of multi-control Toffoli gate dense quantum circuits.
arXiv Detail & Related papers (2025-03-05T16:07:21Z)
Halving the Cost of Quantum Algorithms with Randomization [0.138120109831448]
Quantum signal processing (QSP) provides a systematic framework for implementing a transformation of a linear operator.<n>Recent works have developed randomized compiling, a technique that promotes a unitary gate to a quantum channel.<n>Our algorithm implements a probabilistic mixture of randomizeds, strategically chosen so that the average evolution converges to that of a target function, with an error quadratically smaller than that of an equivalent individual.
arXiv Detail & Related papers (2024-09-05T17:56:51Z)
CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization [8.92409376299856]
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. Quantization has emerged as a key technique for enabling the compression of large models with minimal impact on performance. The GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs. We introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance.
arXiv Detail & Related papers (2024-06-25T13:29:14Z)
Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z)
Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models [76.52997424694767]
Keyphrase Generation (KPG) is a longstanding task in NLP with widespread applications. Seq2seq pre-trained language models (PLMs) have ushered in a transformative era for KPG, yielding promising performance improvements. This paper undertakes a systematic analysis of the influence of model selection and decoding strategies on PLM-based KPG.
arXiv Detail & Related papers (2023-10-10T07:34:45Z)
QuIP: 2-Bit Quantization of Large Language Models With Guarantees [44.212441764241]
This work studies post-training parameter quantization in large language models (LLMs) We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $textitincoherent$ weight and Hessian matrices.
arXiv Detail & Related papers (2023-07-25T07:44:06Z)
Global optimization of MPS in quantum-inspired numerical analysis [0.0]
The study focuses on the search for the lowest eigenstates of a Hamiltonian equation. Five algorithms are introduced: imaginary-time evolution, steepest gradient descent, an improved descent, an implicitly restarted Arnoldi method, and density matrix renormalization group (DMRG) optimization.
arXiv Detail & Related papers (2023-03-16T16:03:51Z)
Orthogonal Polynomials Approximation Algorithm (OPAA):a functional analytic approach to estimating probability densities [0.0]
We present the new Orthogonal Polynomials Approximation Algorithm (OPAA) OPAA estimates probability distributions using functional analytic approach. It can be applied to estimating the normalizing weight of the posterior.
arXiv Detail & Related papers (2022-11-16T00:51:00Z)
Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error. Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
A Parallelizable Lattice Rescoring Strategy with Neural Language Models [62.20538383769179]
A posterior-based lattice expansion algorithm is proposed for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. Experiments on the Switchboard dataset show that the proposed rescoring strategy obtains comparable recognition performance. The parallel rescoring method offers more flexibility by simplifying the integration of PyTorch-trained neural LMs for lattice rescoring with Kaldi.
arXiv Detail & Related papers (2021-03-08T21:23:12Z)
Accelerated Message Passing for Entropy-Regularized MAP Inference [89.15658822319928]
Maximum a posteriori (MAP) inference in discrete-valued random fields is a fundamental problem in machine learning. Due to the difficulty of this problem, linear programming (LP) relaxations are commonly used to derive specialized message passing algorithms. We present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient.
arXiv Detail & Related papers (2020-07-01T18:43:32Z)
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning. Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$. Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.