MuonAll: Muon Variant for Efficient Finetuning of Large Language Models
- URL: http://arxiv.org/abs/2511.06086v1
- Date: Sat, 08 Nov 2025 17:45:20 GMT
- Title: MuonAll: Muon Variant for Efficient Finetuning of Large Language Models
- Authors: Saurabh Page, Advait Joshi, S. S. Sonawane,
- Abstract summary: We introduce MuonAll, which incorporates all the parameters inside Muon by transforming into 2D matrices.<n>We conduct extensive finetuning experiments across publicly available language models with model sizes upto half billion parameters.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Muon optimizer has demonstrated robust results in pretraining of language models but its performance in finetuning of existing public pretrained models is not yet explored. Currently, Muon is used along with AdamW introducing a scope of improvement for adopting all parameters inside Muon. We introduce MuonAll, which incorporates all the parameters inside Muon by transforming into 2D matrices. We conduct extensive finetuning experiments across publicly available language models with model sizes upto half billion parameters. Muon and MuonAll perform at par with AdamW across major benchmarks, highlighting their effectiveness as alternative optimizers. We open-source the distributed implementations of Muon and MuonAll, available at https://github.com/Saurabh750/optimizer
Related papers
- NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training [50.27276603708547]
We show that despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines.<n>We propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure.
arXiv Detail & Related papers (2026-03-04T00:10:14Z) - MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation [60.1890607252082]
MuonRec is the first framework that brings the proposed Muon iteration to RecSys training.<n>We develop an open-source training recipe for recommendation models and evaluate it across both traditional sequential recommenders and modern generative recommenders.
arXiv Detail & Related papers (2026-02-28T02:32:44Z) - Muon+: Towards Better Muon via One Additional Normalization Step [18.816463168231618]
We propose a simple yet effective enhancement to Muon, namely Muon+.<n>We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures.
arXiv Detail & Related papers (2026-02-25T04:04:00Z) - NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z) - Muon Outperforms Adam in Tail-End Associative Memory Learning [118.98991042050532]
We show that Muon consistently achieves balanced learning across classes regardless of feature embeddings.<n>Our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories.
arXiv Detail & Related papers (2025-09-30T10:04:08Z) - LiMuon: Light and Fast Muon Optimizer for Large Models [45.11415579822849]
We propose a useful Muon for training large models.<n>Our LiMuon has a lower memory than the current Muon and its variants.<n>We prove that our LiMuon has a sample $O(epsilon-3)$ under the generalized smooth condition.
arXiv Detail & Related papers (2025-09-18T02:49:27Z) - AdaMuon: Adaptive Muon Optimizer [11.281916426508216]
AdaMuon combines element-wise adaptivity with orthogonal updates for large-scale neural network training.<n>AdaMuon maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.
arXiv Detail & Related papers (2025-07-15T05:49:37Z) - Practical Efficiency of Muon for Pretraining [13.914926836677648]
We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes.<n>We present a simple algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources.
arXiv Detail & Related papers (2025-05-04T19:14:43Z) - Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.