Related papers: CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

URL: http://arxiv.org/abs/2510.18784v2
Date: Mon, 10 Nov 2025 17:53:51 GMT
Title: CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
Authors: Soroush Tabesh, Mher Safaryan, Andrei Panferov, Alexandra Volkova, Dan Alistarh,
Abstract summary: We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
Score: 73.46600457802693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over https://github.com/IST-DASLab/CAGE .

Related papers

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study [59.44848132298657]
Post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings.<n>In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models.
arXiv Detail & Related papers (2026-01-21T11:22:29Z)
Beyond Outliers: A Study of Optimizers Under Quantization [82.75879062804955]
We study impact of choice on model robustness under quantization.<n>We evaluate how model performance degrades when trained with different baselines.<n>We derive scaling laws for quantization-aware training under different parameters.
arXiv Detail & Related papers (2025-09-27T21:15:22Z)
Compute-Optimal Quantization-Aware Training [50.98555000360485]
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks.<n>Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy.<n>We investigate how different QAT durations impact final performance.
arXiv Detail & Related papers (2025-09-26T21:09:54Z)
Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z)
End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z)
First-Order Error Matters: Accurate Compensation for Quantized Large Language Models [32.69069234109942]
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs)<n>Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error.<n>We propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation.
arXiv Detail & Related papers (2025-07-15T06:18:46Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z)
Error-aware Quantization through Noise Tempering [43.049102196902844]
Quantization-aware training (QAT) optimize model parameters with respect to the end task while simulating quantization error. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. Our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
arXiv Detail & Related papers (2022-12-11T20:37:50Z)
SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT) Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%. Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z)
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training [8.106641866299377]
Current practices rely on scalars to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clippeds And Vectors ( OCTAV), a algorithm to determine MSE-optimal clipping scalars. OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the quantization-aware training (QAT) routine.
arXiv Detail & Related papers (2022-06-13T22:15:21Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.