Stabilizing Quantization-Aware Training by Implicit-Regularization on Hessian Matrix
- URL: http://arxiv.org/abs/2503.11159v1
- Date: Fri, 14 Mar 2025 07:56:20 GMT
- Title: Stabilizing Quantization-Aware Training by Implicit-Regularization on Hessian Matrix
- Authors: Junbiao Pang, Tianyang Cai,
- Abstract summary: We find that the sharp landscape of loss, which leads to a dramatic performance drop, is an essential factor that causes instability.<n>We propose Feature-Perturbed Quantization (FPQ) to generalize and employ the feature distillation method to the quantized model.
- Score: 0.7261171488281837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization-Aware Training (QAT) is one of the prevailing neural network compression solutions. However, its stability has been challenged for yielding deteriorating performances as the quantization error is inevitable. We find that the sharp landscape of loss, which leads to a dramatic performance drop, is an essential factor that causes instability. Theoretically, we have discovered that the perturbations in the feature would bring a flat local minima. However, simply adding perturbations into either weight or feature empirically deteriorates the performance of the Full Precision (FP) model. In this paper, we propose Feature-Perturbed Quantization (FPQ) to stochastically perturb the feature and employ the feature distillation method to the quantized model. Our method generalizes well to different network architectures and various QAT methods. Furthermore, we mathematically show that FPQ implicitly regularizes the Hessian norm, which calibrates the smoothness of a loss landscape. Extensive experiments demonstrate that our approach significantly outperforms the current State-Of-The-Art (SOTA) QAT methods and even the FP counterparts.
Related papers
- RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Oscillations Make Neural Networks Robust to Quantization [0.16385815610837165]
We show that oscillations in Quantization Aware Training (QAT) are undesirable artifacts caused by the Straight-Through Estimator (STE)<n>We propose a novel regularization method that induces oscillations to improve quantization.
arXiv Detail & Related papers (2025-02-01T16:39:58Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers [2.0862654518798034]
We propose a Distribution-Friendly and Outlier-Aware Post-training Quantization method for Vision Transformers.
DopQ-ViT analyzes the inefficiencies of current quantizers and introduces a distribution-friendly Tan Quantizer called TanQ.
DopQ-ViT has been extensively validated and significantly improves the performance of quantization models.
arXiv Detail & Related papers (2024-08-06T16:40:04Z) - Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density [93.32594873253534]
Trustworthy machine learning requires meticulous regulation of model reliance on non-robust features.
We propose a framework to delineate and regulate such features by attributing model predictions to the input.
arXiv Detail & Related papers (2024-07-05T09:16:56Z) - Improving Parameter Training for VQEs by Sequential Hamiltonian Assembly [4.646930308096446]
A central challenge in quantum machine learning is the design and training of parameterized quantum circuits (PQCs)
We propose a Sequential Hamiltonian Assembly, which iteratively approximates the loss function using local components.
Our approach outperforms conventional parameter training by 29.99% and the empirical state of the art, Layerwise Learning, by 5.12% in the mean accuracy.
arXiv Detail & Related papers (2023-12-09T11:47:32Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Sharpness-aware Quantization for Deep Neural Networks [45.150346855368]
Sharpness-Aware Quantization (SAQ) is a novel method to explore the effect of Sharpness-Aware Minimization (SAM) on model compression.
We show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization.
arXiv Detail & Related papers (2021-11-24T05:16:41Z) - Intriguing Properties of Input-dependent Randomized Smoothing [6.0887051533533265]
We present one concrete design of the smoothing variance function and test it on CIFAR10 and MNIST.
We show that input-dependent smoothing suffers from the curse of dimensionality, forcing the variance function to have low semi-elasticity.
We provide a theoretical and practical framework that enables the usage of input-dependent smoothing even in the presence of the curse of dimensionality.
arXiv Detail & Related papers (2021-10-11T15:50:49Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - Role of Orthogonality Constraints in Improving Properties of Deep
Networks for Image Classification [8.756814963313804]
We propose an Orthogonal Sphere (OS) regularizer that emerges from physics-based latent-representations under simplifying assumptions.
Under further simplifying assumptions, the OS constraint can be written in closed-form as a simple orthonormality term and be used along with the cross-entropy loss function.
We demonstrate the effectiveness of the proposed OS regularization by providing quantitative and qualitative results on four benchmark datasets.
arXiv Detail & Related papers (2020-09-22T18:46:05Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.