Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training
- URL: http://arxiv.org/abs/2601.01306v2
- Date: Tue, 06 Jan 2026 22:53:02 GMT
- Title: Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training
- Authors: John Zhao,
- Abstract summary: We show how to reliably guarantee the spectral conditions required by $$P for large language model (LLM) training.<n>We develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The $μ$-parameterization ($μ$P) provides a principled foundation for large language model (LLM) training by prescribing width-independent learning dynamics, which in turn enables predictable scaling behavior and robust hyperparameter transfer across model sizes. A central requirement of $μ$P is the satisfaction of certain spectral conditions on weight matrices, which ensure consistent feature learning and optimization behavior as model width grows. While these conditions are well understood in theory, guaranteeing their validity in practical training for matrix-based optimizers such as Muon is still under studied. Existing works that study Muon under $μ$P exhibit important limitations: they either do not ensure that the spectral conditions hold throughout the entire training horizon, or require repeated spectral normalization (or Newton-Schulz iterations) applied to both weights and updates, leading to significant computational overhead and reduced practicality. In this work, we show how to reliably guarantee the spectral conditions required by $μ$P for Muon during the entire training process. Our key insight is that for moderately large models, maintaining spectral control at the level of optimizer updates alone is sufficient to preserve $μ$P-compatible scaling, eliminating the need for explicit spectral normalization of the weights. Based on this principle, we develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process. Our results bridge the gap between the theoretical promises of $μ$P and the practical deployment of matrix-based optimizers in long-horizon training. We also take the first step towards an adaptive spectral condition by incorporating data-dependent effects, making it better suited for long-horizon LLM training.
Related papers
- Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers [3.5708391029226885]
We propose a novel framework to derive $$P for a broader class of derivations, including AdamW, AD, LAMB, Sophia, Shampoo and Muon.<n>We implement our $$Ps on multiple benchmark models and demonstrate zero-shot learning rate transfer across increasing model width.
arXiv Detail & Related papers (2026-02-24T14:17:51Z) - Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning [10.647088281181222]
SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
arXiv Detail & Related papers (2026-02-18T03:56:20Z) - Elastic Spectral State Space Models for Budgeted Inference [6.579320299248572]
Foundation models are typically trained at a fixed computational capacity, while real-world applications require deployment across platforms with different resource constraints.<n>We propose Elastic Spectral State Space Models (ES-SSM), which require only one-time training at full capacity, but can be directly truncated into arbitrary scales for budgeted, runtime inference without retraining.<n>We demonstrate that a single ES-SSM model trained once can be truncated to provide competitive performance compared with modern Transformer and SSM baselines at similar parameter scales.
arXiv Detail & Related papers (2026-01-30T02:58:19Z) - Controlled LLM Training on Spectral Sphere [76.60985966206746]
We introduce the textbfSpectral Sphere algorithm (SSO), which enforces strict module-wise spectral constraints on both weights and their updates.<n>We observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.
arXiv Detail & Related papers (2026-01-13T09:59:47Z) - How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z) - Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z) - Scaling Laws and In-Context Learning: A Unified Theoretical Framework [0.0]
In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates.<n>We present a unified theoretical framework connecting scaling laws to ICL emergence in transformers.<n>We show that ICL performance follows power-law relationships with model depth $L$, width $d$, context length $k$, and training data $D$, with exponents determined by task structure.
arXiv Detail & Related papers (2025-11-09T05:19:14Z) - How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data [38.54408542311739]
We show that spectrum-aware matrix generalizations such as Muon and Shampoo might outperform competitive algorithms.<n>We empirically verify our theoretical findings on a variety of imbalanced datasets.
arXiv Detail & Related papers (2025-10-27T04:00:42Z) - POME: Post Optimization Model Edit via Muon-style Projection [74.73326657229347]
Post-Optimization Model Edit (POME) enhances the performance of fine-tuned large language models.<n>It uses a muon-style projection to $Delta W$, the difference between the fine-tuned and pretrained weights.<n>As a simple post-processing step, POME is completely decoupled from the training pipeline.
arXiv Detail & Related papers (2025-10-08T04:20:11Z) - Muon Optimizes Under Spectral Norm Constraints [12.29696026957078]
We show that Muon implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices.<n>This perspective allows for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
arXiv Detail & Related papers (2025-06-18T01:32:39Z) - Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z) - Hyperspherical Normalization for Scalable Deep Reinforcement Learning [57.016639036237315]
SimbaV2 is a novel reinforcement learning architecture designed to stabilize optimization.<n>It scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks.
arXiv Detail & Related papers (2025-02-21T08:17:24Z) - From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs [37.50902921493273]
Training large language models (LLMs) for different inference constraints is computationally expensive.<n> DynaMoE adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost.<n>Our method achieves similar aggregated accuracy across downstream tasks, despite using only $frac19textth$ of their fine-tuning cost.
arXiv Detail & Related papers (2025-02-17T21:12:57Z) - Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [50.9692060692705]
This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers for offline RL.<n>Our framework highlights four crucial components:.<n>Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method,.<n>In particular, our method demonstrates superior performance in scenarios with limited data samples.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.