Related papers: Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

URL: http://arxiv.org/abs/2511.00874v1
Date: Sun, 02 Nov 2025 09:49:34 GMT
Title: Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding
Authors: Taowen Liu, Marta Andronic, Deniz Gündüz, George A. Constantinides,
Abstract summary: Quantized training improves computational and memory efficiency but introduces quantization noise.<n>We show that increased batch sizes can compensate for reduced precision during back-propagation.<n>We also show that quantizing weights and activations impacts gradient variance in distinct ways.
Score: 37.30928503608494
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors -- especially batch size -- remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.

Related papers

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
Reinforcement learning entangling operations on spin qubits [0.0]
We present a reinforcement learning approach to find entangling protocols for semiconductor-based singlet-triplet qubits in a double quantum dot.<n>We demonstrate that an RL agent can yield performative protocols, while avoiding the model-biases of traditional gradient-based methods.
arXiv Detail & Related papers (2025-08-20T15:05:38Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z)
Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning [20.91559450517002]
It is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. We introduce a novel compression scheme specifically engineered for heavy-tailed gradient gradients, which effectively combines truncation with quantization.
arXiv Detail & Related papers (2024-02-02T06:14:31Z)
A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness [29.87592869483743]
A recent empirical observation of activation sparsity in blocks offers an opportunity to drastically reduce computation costs for free. We propose the notion of sparsity as one source of activation sparsity and a theoretical explanation based on it.
arXiv Detail & Related papers (2023-09-06T13:48:40Z)
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime. We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z)
Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been achieved with empirical straight-through (ST) approaches. At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.