First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
- URL: http://arxiv.org/abs/2507.11017v1
- Date: Tue, 15 Jul 2025 06:18:46 GMT
- Title: First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
- Authors: Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu,
- Abstract summary: Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs)<n>Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error.<n>We propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation.
- Score: 32.69069234109942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.
Related papers
- MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression [7.6131620435684875]
SLIM is a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation.<n>SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods.
arXiv Detail & Related papers (2024-10-12T18:36:07Z) - QERA: an Analytical Framework for Quantization Error Reconstruction [12.110441045050223]
An increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms.<n>The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods.<n>We formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem.
arXiv Detail & Related papers (2024-10-08T13:37:34Z) - Continuous Approximations for Improving Quantization Aware Training of LLMs [4.435218424434634]
Quantization Aware Training (QAT), an effective model compression method, is proposed to reduce performance degradation after quantization.
We introduce two continuous approximations to the QAT process on the rounding function, traditionally approximated by the Straight-Through Estimator (STE) and the clamping function.
By applying both methods, the perplexity (PPL) on the WikiText-v2 dataset of the quantized model reaches 9.0815, outperforming 9.9621 by the baseline.
arXiv Detail & Related papers (2024-10-06T04:33:06Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.<n>Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.<n>This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z) - Minimize Quantization Output Error with Bias Compensation [35.43358597502087]
Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs)
In this paper, we propose a method that improves accuracy without quantizing the output error.
We conduct experiments on Vision models and Large Language Models.
arXiv Detail & Related papers (2024-04-02T12:29:31Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z) - APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models [12.006605064782567]
We propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for Large Language Models.
We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction.
Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity.
arXiv Detail & Related papers (2024-02-21T07:45:22Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.