HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
- URL: http://arxiv.org/abs/2601.20745v1
- Date: Wed, 28 Jan 2026 16:22:42 GMT
- Title: HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
- Authors: Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang,
- Abstract summary: Hestia is a Hessian-guided differentiable QAT framework for extremely low-bit models.<n>Hestia consistently outperforms existing ternary QAT baselines.
- Score: 7.58074574044232
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.
Related papers
- Stabilizing Native Low-Rank LLM Pretraining [24.2079184778031]
Low-rank factorization offers a promising route to reduce training and inference costs.<n>We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights.<n>Our method enables stable, end-to-end factorized training with negligible overhead.
arXiv Detail & Related papers (2026-02-12T21:33:14Z) - CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training [73.46600457802693]
We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
arXiv Detail & Related papers (2025-10-21T16:33:57Z) - End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z) - Fine-tuning Quantized Neural Networks with Zeroth-order Optimization [21.0540879091664]
We propose Quantized Zeroth-order Optimization (QZO), a simple yet effective approach that perturbs the continuous quantization scale for gradient estimation.<n>QZO can reduce the total memory cost by more than 18$times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B within a single 24GB GPU.
arXiv Detail & Related papers (2025-05-19T17:55:15Z) - Hyperspherical Normalization for Scalable Deep Reinforcement Learning [57.016639036237315]
SimbaV2 is a novel reinforcement learning architecture designed to stabilize optimization.<n>It scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks.
arXiv Detail & Related papers (2025-02-21T08:17:24Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - QuEST: Stable Training of LLMs with 1-Bit Weights and Activations [27.644652093888745]
QuEST is a new method for training sparse or quantized language models.<n>We show optimality at 4-bits and stable convergence as low as 1-bit weights and activations.<n> Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions.
arXiv Detail & Related papers (2025-02-07T15:23:34Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Error-aware Quantization through Noise Tempering [43.049102196902844]
Quantization-aware training (QAT) optimize model parameters with respect to the end task while simulating quantization error.
In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator.
Our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
arXiv Detail & Related papers (2022-12-11T20:37:50Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.