Related papers: CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

URL: http://arxiv.org/abs/2511.19705v1
Date: Mon, 24 Nov 2025 21:15:16 GMT
Title: CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding
Authors: Ziteng Sun, Adrian Benton, Samuel Kushnir, Asher Trockman, Vikas Singh, Suhas Diggavi, Ananda Theertha Suresh,
Abstract summary: Post-training quantization is an effective method for reducing the serving cost of large language models.<n>Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data.<n>We propose algorithms to optimize transformations and adaptive rounding without access to any calibration data.
Score: 32.70843445702854
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

Related papers

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering [3.7758197704962835]
We introduce CORAL, a regularized-time steering method that captures correctness signals from model internal activations using weight-decay probes.<n>CORAL consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average.<n>Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient.
arXiv Detail & Related papers (2026-02-05T18:55:56Z)
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z)
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization [52.77441224845925]
Quantization to low bitwidth is a standard approach for deploying large language models.<n>A few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer.<n>We derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization.
arXiv Detail & Related papers (2025-11-30T16:17:34Z)
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training [73.46600457802693]
We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
arXiv Detail & Related papers (2025-10-21T16:33:57Z)
DLBAcalib: Robust Extrinsic Calibration for Non-Overlapping LiDARs Based on Dual LBA [11.721420447780032]
This paper presents a novel targetless extrinsic calibration framework for multi-LiDAR systems.<n>The proposed method constructs an accurate reference point cloud map via continuous scanning from the target LiDAR.<n>Our framework achieves an average translational error of 5 mm and a rotational error of 0.2deg, with an initial error tolerance of up to 0.4 m/30deg.
arXiv Detail & Related papers (2025-07-12T07:48:02Z)
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.<n>We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z)
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration [21.474315621757594]
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures.<n>Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model.<n>GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization.
arXiv Detail & Related papers (2025-04-03T15:30:43Z)
OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting [20.944120156871108]
Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs)<n>The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values.<n>We introduce Quantization Space Utilization Rate (BrotherQSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space.
arXiv Detail & Related papers (2025-01-23T08:24:25Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
Sharp Calibrated Gaussian Processes [58.94710279601622]
State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance. We present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance. Our approach is shown to yield a calibrated model under reasonable assumptions.
arXiv Detail & Related papers (2023-02-23T12:17:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.