Improving Quantization with Post-Training Model Expansion
- URL: http://arxiv.org/abs/2503.17513v1
- Date: Fri, 21 Mar 2025 19:56:59 GMT
- Title: Improving Quantization with Post-Training Model Expansion
- Authors: Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott,
- Abstract summary: Post-training model expansion is a viable strategy to improve model quality within a quantization co-design space.<n>We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining.
- Score: 0.35377121774178694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model.
Related papers
- Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization.
The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining [0.0]
Post-training quantization reduces model size efficiently at the cost of decreased accuracy.
quantization-aware training better preserves accuracy but is resource-intensive.
We propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining.
arXiv Detail & Related papers (2025-04-14T19:31:21Z) - FP=xINT:A Low-Bit Series Expansion Algorithm for Post-Training Quantization [3.560046736432574]
Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training.<n>Existing methods significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise.<n>We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning.
arXiv Detail & Related papers (2024-12-09T08:50:28Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.<n>For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.<n>For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - Taming 3DGS: High-Quality Radiance Fields with Limited Resources [50.92437599516609]
3D Gaussian Splatting (3DGS) has transformed novel-view synthesis with its fast, interpretable, and high-fidelity rendering.
We tackle the challenges of training and rendering 3DGS models on a budget.
We derive faster, numerically equivalent solutions for gradient computation and attribute updates.
arXiv Detail & Related papers (2024-06-21T20:44:23Z) - Hyperspherical Quantization: Toward Smaller and More Accurate Models [17.154801913113566]
Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings.
Binary and other low-precision quantization methods can reduce the model size up to 32$times$, however, at the cost of a considerable accuracy drop.
We propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models.
arXiv Detail & Related papers (2022-12-24T04:42:15Z) - SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT)
Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%.
Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z) - HERO: Hessian-Enhanced Robust Optimization for Unifying and Improving
Generalization and Quantization Performance [43.478851400266926]
We propose HERO, a Hessian-enhanced robust optimization method, to minimize the Hessian eigenvalues through a gradient-based training process.
HERO enables up to a 3.8% gain on test accuracy, up to 30% higher accuracy under 80% training label perturbation, and the best post-training quantization accuracy across a wide range of precision.
arXiv Detail & Related papers (2021-11-23T16:32:58Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.