Related papers: Delving into Muon and Beyond: Deep Analysis and Extensions

Delving into Muon and Beyond: Deep Analysis and Extensions

URL: http://arxiv.org/abs/2602.04669v1
Date: Wed, 04 Feb 2026 15:40:47 GMT
Title: Delving into Muon and Beyond: Deep Analysis and Extensions
Authors: Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, Rong Xiao,
Abstract summary: We study Muon as the p = 0 endpoint of a family of spectral transformations of the form U boldsymbolp V'.<n>We find that RMS-normalized updates yield more stable optimization than first-moment updates.<n>Our results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method.
Score: 8.297062899157664
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix-shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbolΣ^{p} V' , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at https://github.com/Ocram7/BeyondMuon.

Related papers

MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation [60.1890607252082]
MuonRec is the first framework that brings the proposed Muon iteration to RecSys training.<n>We develop an open-source training recipe for recommendation models and evaluate it across both traditional sequential recommenders and modern generative recommenders.
arXiv Detail & Related papers (2026-02-28T02:32:44Z)
Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning [10.647088281181222]
SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
arXiv Detail & Related papers (2026-02-18T03:56:20Z)
TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers [24.534939825452884]
TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping.<n>We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers.<n> Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines.
arXiv Detail & Related papers (2026-02-13T22:11:59Z)
Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z)
Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z)
Beyond the Ideal: Analyzing the Inexact Muon Update [54.70108543057578]
We show first analysis of the inexactized update at Muon's core.<n>We reveal a fundamental coupling between this inexactness and the optimal step size and momentum.
arXiv Detail & Related papers (2025-10-22T18:01:07Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z)
AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates [0.0]
We study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region.<n>We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time that achieves strong performance without constructing semi-orthogonal matrices.<n>Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods.
arXiv Detail & Related papers (2025-09-29T06:03:53Z)
Conda: Column-Normalized Adam for Training Large Language Models Faster [70.66067959375748]
Column-Normalized Adam (Conda) is a novel approach to large language models (LLMs)<n>Conda projects updates into a subspace and applies column-wise second moment normalization based on the projected gradients.<n>Experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training.
arXiv Detail & Related papers (2025-09-29T02:58:19Z)
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.