Related papers: Muon+: Towards Better Muon via One Additional Normalization Step

Muon+: Towards Better Muon via One Additional Normalization Step

URL: http://arxiv.org/abs/2602.21545v2
Date: Thu, 26 Feb 2026 17:01:08 GMT
Title: Muon+: Towards Better Muon via One Additional Normalization Step
Authors: Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang,
Abstract summary: We propose a simple yet effective enhancement to Muon, namely Muon+.<n>We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures.
Score: 18.816463168231618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Related papers

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training [50.27276603708547]
We show that despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines.<n>We propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure.
arXiv Detail & Related papers (2026-03-04T00:10:14Z)
MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation [60.1890607252082]
MuonRec is the first framework that brings the proposed Muon iteration to RecSys training.<n>We develop an open-source training recipe for recommendation models and evaluate it across both traditional sequential recommenders and modern generative recommenders.
arXiv Detail & Related papers (2026-02-28T02:32:44Z)
Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum [19.385264518362472]
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks.<n>We propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum.<n>Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines.
arXiv Detail & Related papers (2026-01-21T02:41:56Z)
Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z)
MuonAll: Muon Variant for Efficient Finetuning of Large Language Models [0.0]
We introduce MuonAll, which incorporates all the parameters inside Muon by transforming into 2D matrices.<n>We conduct extensive finetuning experiments across publicly available language models with model sizes upto half billion parameters.
arXiv Detail & Related papers (2025-11-08T17:45:20Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
Muon Outperforms Adam in Tail-End Associative Memory Learning [118.98991042050532]
We show that Muon consistently achieves balanced learning across classes regardless of feature embeddings.<n>Our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories.
arXiv Detail & Related papers (2025-09-30T10:04:08Z)
Muon: Training and Trade-offs with Latent Attention and MoE [4.500362688166346]
We present a comprehensive theoretical and empirical study of the Muon for training transformers only with a small to medium decoder (30M - 200M parameters)<n>We provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm.
arXiv Detail & Related papers (2025-09-29T07:51:06Z)
AdaMuon: Adaptive Muon Optimizer [11.281916426508216]
AdaMuon combines element-wise adaptivity with orthogonal updates for large-scale neural network training.<n>AdaMuon maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.
arXiv Detail & Related papers (2025-07-15T05:49:37Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.