HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
- URL: http://arxiv.org/abs/2601.21626v1
- Date: Thu, 29 Jan 2026 12:27:05 GMT
- Title: HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
- Authors: Jinhao Zhang Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng,
- Abstract summary: We propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization.<n> Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant.
- Score: 5.407724832457912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.
Related papers
- Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization [56.5199302532159]
We propose an Activation-guided Structured Regularization framework to suppress the negative effects of outliers.<n>Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations.<n>Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
arXiv Detail & Related papers (2026-02-07T15:50:18Z) - LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation [6.797237769820339]
Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy.<n>We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization.<n> Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization.
arXiv Detail & Related papers (2026-01-27T14:56:04Z) - End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z) - FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z) - Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning [39.56908863102256]
Low-bit post-training quantization impairs mathematical reasoning up to 69.81% in harder settings.<n>We address two deployment-critical questions with process-level precision.<n>In our settings, as few as 332 curated examples and 3--5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline.
arXiv Detail & Related papers (2025-05-16T12:11:40Z) - PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals [10.860081994662645]
Post-training quantization of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time.<n>We propose ResQ, a PTQ method that pushes further the state-of-the-art.<n>We demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks.
arXiv Detail & Related papers (2024-12-18T22:01:55Z) - QERA: an Analytical Framework for Quantization Error Reconstruction [12.110441045050223]
An increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms.<n>The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods.<n>We formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem.
arXiv Detail & Related papers (2024-10-08T13:37:34Z) - QT-DoG: Quantization-aware Training for Domain Generalization [58.439816306817306]
We propose Quantization-aware Training for Domain Generalization (QT-DoG)<n>We demonstrate that weight quantization effectively leads to flatter minima in the loss landscape.<n> QT-DoG exploits quantization as an implicit regularizer by inducing noise in model weights.
arXiv Detail & Related papers (2024-10-08T13:21:48Z) - Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.<n>Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.<n>This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.