ECO: Quantized Training without Full-Precision Master Weights
- URL: http://arxiv.org/abs/2601.22101v1
- Date: Thu, 29 Jan 2026 18:35:01 GMT
- Title: ECO: Quantized Training without Full-Precision Master Weights
- Authors: Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni,
- Abstract summary: Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
- Score: 58.97082407934466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.
Related papers
- SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training [5.341188930460575]
Current mixed-precision training approaches either apply uniform precision to all GEMM operations or rely on methods that fail to generalize during training.<n>This paper introduces SNIP, a fine-grained adaptive mixed-precision training framework for LLM pretraining that supports subbyte precision.<n> Experiments on 1B, 3B, 7B and 70B Llama-like models demonstrate that SNIP consistently outperforms existing baselines.
arXiv Detail & Related papers (2026-02-01T19:34:27Z) - A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization [32.97211471008323]
We introduce the first theoretical framework of adaptive convergences, including Adam and Muon, under floating-point quantization of gradients, weights, and states.<n>We show that both algorithms retain convergence rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations.<n>Our analysis further reveals that Adam is highly sensitive to and second-moment quantization weights due to its reliance on $beta to 1$, while Muon requires weaker error control and is thus potentially more robust.
arXiv Detail & Related papers (2025-10-24T10:16:23Z) - Beyond Outliers: A Study of Optimizers Under Quantization [82.75879062804955]
We study impact of choice on model robustness under quantization.<n>We evaluate how model performance degrades when trained with different baselines.<n>We derive scaling laws for quantization-aware training under different parameters.
arXiv Detail & Related papers (2025-09-27T21:15:22Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models [3.093903491123962]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.<n> structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation.<n>We introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference.<n>We propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme.
arXiv Detail & Related papers (2024-12-16T10:14:01Z) - Direct Quantized Training of Language Models with Stochastic Rounding [8.442358264368693]
Experimental results on LLaMA-structured models of various sizes indicate that training with lowprecision weights is feasible even when constrained to ternary values.<n>Our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments.
arXiv Detail & Related papers (2024-12-06T05:41:11Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.