Delving into Muon and Beyond: Deep Analysis and Extensions
- URL: http://arxiv.org/abs/2602.04669v1
- Date: Wed, 04 Feb 2026 15:40:47 GMT
- Title: Delving into Muon and Beyond: Deep Analysis and Extensions
- Authors: Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, Rong Xiao,
- Abstract summary: We study Muon as the p = 0 endpoint of a family of spectral transformations of the form U boldsymbolp V'.<n>We find that RMS-normalized updates yield more stable optimization than first-moment updates.<n>Our results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method.
- Score: 8.297062899157664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix-shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbolΣ^{p} V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at https://github.com/Ocram7/BeyondMuon.
Related papers
- MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation [60.1890607252082]
MuonRec is the first framework that brings the proposed Muon iteration to RecSys training.<n>We develop an open-source training recipe for recommendation models and evaluate it across both traditional sequential recommenders and modern generative recommenders.
arXiv Detail & Related papers (2026-02-28T02:32:44Z) - Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning [10.647088281181222]
SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
arXiv Detail & Related papers (2026-02-18T03:56:20Z) - TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers [24.534939825452884]
TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping.<n>We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers.<n> Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines.
arXiv Detail & Related papers (2026-02-13T22:11:59Z) - Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z) - Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z) - Beyond the Ideal: Analyzing the Inexact Muon Update [54.70108543057578]
We show first analysis of the inexactized update at Muon's core.<n>We reveal a fundamental coupling between this inexactness and the optimal step size and momentum.
arXiv Detail & Related papers (2025-10-22T18:01:07Z) - NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z) - Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z) - AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates [0.0]
We study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region.<n>We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time that achieves strong performance without constructing semi-orthogonal matrices.<n>Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods.
arXiv Detail & Related papers (2025-09-29T06:03:53Z) - Conda: Column-Normalized Adam for Training Large Language Models Faster [70.66067959375748]
Column-Normalized Adam (Conda) is a novel approach to large language models (LLMs)<n>Conda projects updates into a subspace and applies column-wise second moment normalization based on the projected gradients.<n>Experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training.
arXiv Detail & Related papers (2025-09-29T02:58:19Z) - AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.