Related papers: Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

URL: http://arxiv.org/abs/2512.05620v1
Date: Fri, 05 Dec 2025 11:03:41 GMT
Title: Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales
Authors: Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, Andrew Gordon Wilson,
Abstract summary: We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
Score: 55.91454326946738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as $μ$P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to $μ$P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve $1.4\times$ and $1.3\times$ speedup over AdamW for training Llama-architecture language models of sizes ranging from $190$M to $1.4$B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.

Related papers

Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers [3.5708391029226885]
We propose a novel framework to derive $$P for a broader class of derivations, including AdamW, AD, LAMB, Sophia, Shampoo and Muon.<n>We implement our $$Ps on multiple benchmark models and demonstrate zero-shot learning rate transfer across increasing model width.
arXiv Detail & Related papers (2026-02-24T14:17:51Z)
Towards Robust Scaling Laws for Optimizers [89.21160945066737]
Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
arXiv Detail & Related papers (2026-02-07T21:40:33Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
Optimal Scaling Needs Optimal Norm [1.8180584498244492]
Joint optimal scaling across model and dataset sizes is governed by a single invariant.<n>Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(etaast, Bast)$ consistently has the same operator norm value.
arXiv Detail & Related papers (2025-10-04T16:48:36Z)
Fantastic Pretraining Optimizers and Where to Find Them [59.56075036649332]
AdamW has long been the dominant gradients in language model pretraining.<n>Speedup of matrix-based matrices is inversely proportional to model scale.
arXiv Detail & Related papers (2025-09-02T07:43:22Z)
Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [59.369484219304866]
We conduct an unprecedented empirical investigation training over 3,700 Large Language Models (LLMs) from scratch across 100 trillion tokens.<n>We establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law.<n>Our estimated optima deviates from the global best performance found via exhaustive search by merely 0.094% on the test set.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit [1.8337746049048673]
We show an intricate dependence of optimal $eta$ scaling on the pretraining token budget $T$, $B$ and its relation to the critical batch size $B_mathrmcrit$.<n>Surprisingly, our results demonstrate that the observed optimal $eta$ and $B$ dynamics are preserved with $mu$P model scaling.
arXiv Detail & Related papers (2024-10-08T09:06:34Z)
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [59.12061830645018]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales.<n>We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles. We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.