Related papers: The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

URL: http://arxiv.org/abs/2507.18553v1
Date: Thu, 24 Jul 2025 16:22:18 GMT
Title: The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
Authors: Jiale Chen, Torsten Hoefler, Dan Alistarh,
Abstract summary: GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale.<n>We show that GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem.
Score: 52.89358421626026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale. Yet, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure any geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: (i) the GPTQ error propagation step gains an intuitive geometric interpretation; (ii) GPTQ inherits the error upper bound of Babai's algorithm under the no-clipping condition. Taken together, these results place GPTQ on firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

Related papers

The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's algorithm [0.0]
We show how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data.<n>We prove that the GPTQ algorithm is equivalent to Babai's well-known nearest-plane algorithm.
arXiv Detail & Related papers (2025-08-01T21:20:58Z)
Halving the Cost of Quantum Algorithms with Randomization [0.138120109831448]
Quantum signal processing (QSP) provides a systematic framework for implementing a transformation of a linear operator.<n>Recent works have developed randomized compiling, a technique that promotes a unitary gate to a quantum channel.<n>Our algorithm implements a probabilistic mixture of randomizeds, strategically chosen so that the average evolution converges to that of a target function, with an error quadratically smaller than that of an equivalent individual.
arXiv Detail & Related papers (2024-09-05T17:56:51Z)
CDQuant: Greedy Coordinate Descent for Accurate LLM Quantization [8.92409376299856]
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. Quantization has emerged as a key technique for enabling the compression of large models with minimal impact on performance. The GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs. We introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance.
arXiv Detail & Related papers (2024-06-25T13:29:14Z)
Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models [76.52997424694767]
Keyphrase Generation (KPG) is a longstanding task in NLP with widespread applications. Seq2seq pre-trained language models (PLMs) have ushered in a transformative era for KPG, yielding promising performance improvements. This paper undertakes a systematic analysis of the influence of model selection and decoding strategies on PLM-based KPG.
arXiv Detail & Related papers (2023-10-10T07:34:45Z)
QuIP: 2-Bit Quantization of Large Language Models With Guarantees [44.212441764241]
This work studies post-training parameter quantization in large language models (LLMs) We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $textitincoherent$ weight and Hessian matrices.
arXiv Detail & Related papers (2023-07-25T07:44:06Z)
Global optimization of MPS in quantum-inspired numerical analysis [0.0]
The study focuses on the search for the lowest eigenstates of a Hamiltonian equation. Five algorithms are introduced: imaginary-time evolution, steepest gradient descent, an improved descent, an implicitly restarted Arnoldi method, and density matrix renormalization group (DMRG) optimization.
arXiv Detail & Related papers (2023-03-16T16:03:51Z)
Orthogonal Polynomials Approximation Algorithm (OPAA):a functional analytic approach to estimating probability densities [0.0]
We present the new Orthogonal Polynomials Approximation Algorithm (OPAA) OPAA estimates probability distributions using functional analytic approach. It can be applied to estimating the normalizing weight of the posterior.
arXiv Detail & Related papers (2022-11-16T00:51:00Z)
Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error. Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
A Parallelizable Lattice Rescoring Strategy with Neural Language Models [62.20538383769179]
A posterior-based lattice expansion algorithm is proposed for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. Experiments on the Switchboard dataset show that the proposed rescoring strategy obtains comparable recognition performance. The parallel rescoring method offers more flexibility by simplifying the integration of PyTorch-trained neural LMs for lattice rescoring with Kaldi.
arXiv Detail & Related papers (2021-03-08T21:23:12Z)
Accelerated Message Passing for Entropy-Regularized MAP Inference [89.15658822319928]
Maximum a posteriori (MAP) inference in discrete-valued random fields is a fundamental problem in machine learning. Due to the difficulty of this problem, linear programming (LP) relaxations are commonly used to derive specialized message passing algorithms. We present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient.
arXiv Detail & Related papers (2020-07-01T18:43:32Z)
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning. Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$. Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.