Training Dynamics Impact Post-Training Quantization Robustness
- URL: http://arxiv.org/abs/2510.06213v1
- Date: Tue, 07 Oct 2025 17:59:07 GMT
- Title: Training Dynamics Impact Post-Training Quantization Robustness
- Authors: Albert Catalan-Tatjer, Niccolò Ajroldi, Jonas Geiping,
- Abstract summary: Post-training quantization is widely adopted for efficient deployment of large language models.<n>We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens.
- Score: 31.536101256063684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.
Related papers
- Scaling Laws for Precision in High-Dimensional Linear Regression [38.87908801454087]
We study scaling laws for low-precision training within a high-dimensional sketched linear regression framework.<n>By analyzing multiplicative and additive quantization, we identify a critical dichotomy in their scaling behaviors.<n>Our work provides a theoretical basis for optimizing training protocols under practical hardware constraints.
arXiv Detail & Related papers (2026-02-22T15:51:29Z) - Learning under Quantization for High-Dimensional Linear Regression [34.214978824165236]
Low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models.<n>Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent.<n>We present the first systematic theoretical study of this fundamental question, analyzing finite-step gradient descent (SGD) for high-dimensional linear regression.
arXiv Detail & Related papers (2025-10-21T03:30:11Z) - Loss Behavior in Supervised Learning with Entangled States [36.30006416492033]
entanglement with an auxiliary system was shown to increase the quality of QML models in applications such as supervised learning.<n>Recent works focus on the information that can be extracted from entangled training samples and their effect on the approximation error of the trained model.<n>Results on the trainability of QML models show that the training process itself is affected by various properties of the supervised learning task.
arXiv Detail & Related papers (2025-09-12T11:09:24Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)<n>After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.<n>We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Supervised learning for robust quantum control in composite-pulse systems [7.474008952791777]
We develop a supervised learning model for implementing robust quantum control in composite-pulse systems.
This model exhibits great resistance to all kinds of systematic errors, including single, multiple, and time-varying errors.
This work provides a highly efficient learning model for fault-tolerant quantum computation by training various physical parameters.
arXiv Detail & Related papers (2023-08-23T01:37:13Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Zero-shot Adversarial Quantization [11.722728148523366]
We propose a zero-shot adversarial quantization (ZAQ) framework, facilitating effective discrepancy estimation and knowledge transfer.
This is achieved by a novel two-level discrepancy modeling to drive a generator to synthesize informative and diverse data examples.
We conduct extensive experiments on three fundamental vision tasks, demonstrating the superiority of ZAQ over the strong zero-shot baselines.
arXiv Detail & Related papers (2021-03-29T01:33:34Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.