Related papers: Optimal Formats for Weight Quantisation

Optimal Formats for Weight Quantisation

URL: http://arxiv.org/abs/2505.12988v1
Date: Mon, 19 May 2025 11:26:54 GMT
Title: Optimal Formats for Weight Quantisation
Authors: Douglas Orr, Luka Ribar, Carlo Luschi,
Abstract summary: We propose a framework for systematic design and analysis of quantisation formats.<n>We show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes.
Score: 1.6385815610837167
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and the formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. Framing the optimisation problem as minimising the KL divergence between the original and quantised model outputs, the objective is aligned with minimising the squared quantisation error of the model parameters. We therefore develop and evaluate squared-error-optimal formats for known distributions, observing significant improvement of variable-length codes over fixed-length codes. Uniform quantisation followed by lossless compression with a variable-length code is shown to be optimal. However, we find that commonly used block formats and sparse outlier formats also outperform fixed-length codes, implying they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when tested with direct-cast quantisation of language models.

Related papers

Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z)
Flexible Mixed Precision Quantization for Learned Image Compression [4.847449762378203]
We propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network.<n>We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths.
arXiv Detail & Related papers (2025-06-02T00:12:50Z)
Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values [57.54443445583921]
We provide two novel theorems aimed at enhancing KV quantization methods.<n>Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information.<n>Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Diffusion Product Quantization [18.32568431229839]
We explore the quantization of diffusion models in extreme compression regimes to reduce model size while maintaining performance. We apply our compression method to the DiT model on ImageNet and consistently outperform other quantization approaches.
arXiv Detail & Related papers (2024-11-19T07:47:37Z)
Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks [1.042733720689638]
Quantization reduces the model's hardware costs, such as data movement, storage, and operations like multiply and addition. More exotic numerical encodings, such as block-scaled number formats, have shown advantages for utilizing a fixed bit budget to encode model parameters. This paper presents error diffusion (ED) for post-training quantization with support for block-scaled data formats.
arXiv Detail & Related papers (2024-10-15T02:40:50Z)
LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid [36.33062038680275]
Large language models (LLMs) have shown immense potential across various domains. Post-training quantization has emerged as a promising technique to reduce memory requirements and decoding latency. We propose LeanQuant, a novel quantization method that is accurate, versatile, and scalable.
arXiv Detail & Related papers (2024-07-14T00:23:51Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search [7.971065005161565]
quantization is a technique to convert floating point representations to low bit-width fixed point representations. We show how to learn new quantized weights over the entire quantized space. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
arXiv Detail & Related papers (2023-08-10T14:19:58Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.