Related papers: WaterSIC: information-theoretically (near) optimal linear layer quantization

WaterSIC: information-theoretically (near) optimal linear layer quantization

URL: http://arxiv.org/abs/2603.04956v1
Date: Thu, 05 Mar 2026 08:50:58 GMT
Title: WaterSIC: information-theoretically (near) optimal linear layer quantization
Authors: Egor Lifar, Semyon Savkin, Or Ordentlich, Yury Polyanskiy,
Abstract summary: It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit.<n>A novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit.
Score: 24.236435814099707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.

Related papers

High-Rate Quantized Matrix Multiplication: Theory and Practice [29.75700570685703]
This work investigates the problem of quantized matrix multiplication (MatMul)<n>We consider two settings: 1) Generic MatMul, where both matrices must be quantized (weight+ quantization); and 2) weight-only quantization, where the second matrix is only known through covariance matrix $_X$ of its columns.
arXiv Detail & Related papers (2026-01-23T21:32:44Z)
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
InfoQ: Mixed-Precision Quantization via Global Information Flow [3.4096951613673068]
Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices.<n>We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase.
arXiv Detail & Related papers (2025-08-06T11:07:49Z)
BAQ: Efficient Bit Allocation Quantization for Large Language Models [8.427223431012454]
Post-training model quantization is a widely adopted technique for reducing memory and computational costs of large language models.<n>Most existing methods rely on uniform or bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise.<n>We propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy.
arXiv Detail & Related papers (2025-06-06T01:27:01Z)
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration [21.474315621757594]
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures.<n>Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model.<n>GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization.
arXiv Detail & Related papers (2025-04-03T15:30:43Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Reweighted Solutions for Weighted Low Rank Approximation [47.790126028106734]
Weighted low rank approximation (WLRA) is an important yet challenging primitive with applications ranging from statistical analysis to signal processing. In this work, we introduce a new relaxed solution to WLRA which outputs a matrix that is not necessarily low rank, but can be stored using very few parameters. Our central idea is to use the weight matrix itself to reweight a low rank solution, which gives an extremely simple algorithm with remarkable empirical performance.
arXiv Detail & Related papers (2024-06-04T15:50:35Z)
Training quantum neural networks using the Quantum Information Bottleneck method [0.6768558752130311]
We provide a concrete method for training a quantum neural network to maximize the relevant information about a property that is transmitted through the network. This is significant because it gives an operationally well founded quantity to optimize when training autoencoders for problems where the inputs and outputs are fully quantum.
arXiv Detail & Related papers (2022-12-05T21:11:32Z)
Taming Hyperparameter Tuning in Continuous Normalizing Flows Using the JKO Scheme [60.79981399724534]
A normalizing flow (NF) is a mapping that transforms a chosen probability distribution to a normal distribution. We present JKO-Flow, an algorithm to solve OT-based CNF without the need of tuning $alpha$.
arXiv Detail & Related papers (2022-11-30T05:53:21Z)
Automatic and effective discovery of quantum kernels [41.61572387137452]
Quantum computing can empower machine learning models by enabling kernel machines to leverage quantum kernels for representing similarity measures between data.<n>We present an approach to this problem, which employs optimization techniques, similar to those used in neural architecture search and AutoML.<n>The results obtained by testing our approach on a high-energy physics problem demonstrate that, in the best-case scenario, we can either match or improve testing accuracy with respect to the manual design approach.
arXiv Detail & Related papers (2022-09-22T16:42:14Z)
RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions [43.27226390407956]
This work proposes a novel Deep Neural Network (DNN) quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach. The proposed RMSMP is tested for the image classification and natural language processing (BERT) applications. It achieves the best accuracy performance among state-of-the-arts under the same equivalent precisions.
arXiv Detail & Related papers (2021-10-30T02:53:35Z)
An Information Theory-inspired Strategy for Automatic Network Pruning [97.03772272417599]
Deep convolution neural networks are well known to be compressed on devices with resource constraints.<n>Most existing network pruning methods require laborious human efforts and prohibitive computation resources.<n>We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z)
Fast algorithm for quantum polar decomposition, pretty-good measurements, and the Procrustes problem [0.0]
We show that the problem of quantum polar decomposition has a simple and concise implementation via the quantum singular value QSVT. We focus on the applications to pretty-good measurements, a close-to-optimal measurement to distinguish quantum states, and the quantum Procrustes problem.
arXiv Detail & Related papers (2021-06-14T17:50:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.