Related papers: High-Rate Quantized Matrix Multiplication: Theory and Practice

High-Rate Quantized Matrix Multiplication: Theory and Practice

URL: http://arxiv.org/abs/2601.17187v1
Date: Fri, 23 Jan 2026 21:32:44 GMT
Title: High-Rate Quantized Matrix Multiplication: Theory and Practice
Authors: Or Ordentlich, Yury Polyanskiy,
Abstract summary: This work investigates the problem of quantized matrix multiplication (MatMul)<n>We consider two settings: 1) Generic MatMul, where both matrices must be quantized (weight+ quantization); and 2) weight-only quantization, where the second matrix is only known through covariance matrix $_X$ of its columns.
Score: 29.75700570685703
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider two settings: 1) Generic MatMul, where both matrices must be quantized (weight+activation quantization); and 2) weight-only quantization, where the second matrix is only known through covariance matrix $Σ_X$ of its columns. For each setting, we first review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and then analyze the performance of several popular quantization schemes, comparing them to these fundamental limits. Specifically, we discuss rate loss (compared to information theoretic optima) of absmax INT and floating-point (FP) quantization, for which we also derive remarkably accurate heuristic approximations. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. This new scheme (termed ``WaterSIC'') only uses scalar INT quantizers, but its high-rate performance is basis free (it depends only on the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations) and is within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit (!). GPTQ's performance is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find GPTQ to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal (for high-rate quantization).

Related papers

WaterSIC: information-theoretically (near) optimal linear layer quantization [24.236435814099707]
It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit.<n>A novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit.
arXiv Detail & Related papers (2026-03-05T08:50:58Z)
Block encoding of sparse matrices with a periodic diagonal structure [67.45502291821956]
We provide an explicit quantum circuit for block encoding a sparse matrix with a periodic diagonal structure.<n>Various applications for the presented methodology are discussed in the context of solving differential problems.
arXiv Detail & Related papers (2026-02-11T07:24:33Z)
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment [15.802372921412198]
We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data.<n>We first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance.
arXiv Detail & Related papers (2025-09-24T15:10:44Z)
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate [13.14434628836727]
Vector quantization aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure.<n>We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion.<n>Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates.
arXiv Detail & Related papers (2025-04-28T15:05:35Z)
Matrix encoding method in variational quantum singular value decomposition [49.494595696663524]
We propose the variational quantum singular value decomposition based on encoding the elements of the considered $Ntimes N$ matrix into the state of a quantum system of appropriate dimension.<n> Controlled measurement is involved to avoid small success in ancilla measurement.
arXiv Detail & Related papers (2025-03-19T07:01:38Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
QuIP: 2-Bit Quantization of Large Language Models With Guarantees [44.212441764241]
This work studies post-training parameter quantization in large language models (LLMs) We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $textitincoherent$ weight and Hessian matrices.
arXiv Detail & Related papers (2023-07-25T07:44:06Z)
Randomized semi-quantum matrix processing [0.0]
We present a hybrid quantum-classical framework for simulating generic matrix functions. The method is based on randomization over the Chebyshev approximation of the target function. We prove advantages on average depths, including quadratic speed-ups on costly parameters.
arXiv Detail & Related papers (2023-07-21T18:00:28Z)
Huber-energy measure quantization [0.0]
We describe an algorithm which finds the best approximation of a target probability law by a sum of $Q$ Dirac masses. The procedure is implemented by minimizing the statistical distance between the original measure and its quantized version.
arXiv Detail & Related papers (2022-12-15T21:50:54Z)
End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output. We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z)
Quantum algorithms for grid-based variational time evolution [36.136619420474766]
We propose a variational quantum algorithm for performing quantum dynamics in first quantization. Our simulations exhibit the previously observed numerical instabilities of variational time propagation approaches.
arXiv Detail & Related papers (2022-03-04T19:00:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.