CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
- URL: http://arxiv.org/abs/2510.18784v2
- Date: Mon, 10 Nov 2025 17:53:51 GMT
- Title: CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
- Authors: Soroush Tabesh, Mher Safaryan, Andrei Panferov, Alexandra Volkova, Dan Alistarh,
- Abstract summary: We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
- Score: 73.46600457802693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over https://github.com/IST-DASLab/CAGE .
Related papers
- What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study [59.44848132298657]
Post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings.<n>In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models.
arXiv Detail & Related papers (2026-01-21T11:22:29Z) - Beyond Outliers: A Study of Optimizers Under Quantization [82.75879062804955]
We study impact of choice on model robustness under quantization.<n>We evaluate how model performance degrades when trained with different baselines.<n>We derive scaling laws for quantization-aware training under different parameters.
arXiv Detail & Related papers (2025-09-27T21:15:22Z) - Compute-Optimal Quantization-Aware Training [50.98555000360485]
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks.<n>Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy.<n>We investigate how different QAT durations impact final performance.
arXiv Detail & Related papers (2025-09-26T21:09:54Z) - Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z) - End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z) - First-Order Error Matters: Accurate Compensation for Quantized Large Language Models [32.69069234109942]
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs)<n>Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error.<n>We propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation.
arXiv Detail & Related papers (2025-07-15T06:18:46Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - Error-aware Quantization through Noise Tempering [43.049102196902844]
Quantization-aware training (QAT) optimize model parameters with respect to the end task while simulating quantization error.
In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator.
Our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
arXiv Detail & Related papers (2022-12-11T20:37:50Z) - SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT)
Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%.
Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z) - Optimal Clipping and Magnitude-aware Differentiation for Improved
Quantization-aware Training [8.106641866299377]
Current practices rely on scalars to set clipping threshold scalars and cannot be shown to be optimal.
We propose Optimally Clippeds And Vectors ( OCTAV), a algorithm to determine MSE-optimal clipping scalars.
OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the quantization-aware training (QAT) routine.
arXiv Detail & Related papers (2022-06-13T22:15:21Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.