Related papers: Muon is Scalable for LLM Training

Muon is Scalable for LLM Training

URL: http://arxiv.org/abs/2502.16982v1
Date: Mon, 24 Feb 2025 09:12:29 GMT
Title: Muon is Scalable for LLM Training
Authors: Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang,
Abstract summary: We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
Score: 50.68746986439438
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Related papers

AdaMuon: Adaptive Muon Optimizer [11.281916426508216]
We propose AdaMuon, an adaptive learning-rate framework built upon the recently validated Muon, which has demonstrated substantial efficiency gains over AdamW in large-scale model training.<n>Our method introduces no additional tuning burden and can be seamlessly integrated into existing Muon training pipelines.
arXiv Detail & Related papers (2025-07-15T05:49:37Z)
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining [60.02032710118597]
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages.<n>MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed.<n>The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini.
arXiv Detail & Related papers (2025-05-12T14:30:11Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
Practical Efficiency of Muon for Pretraining [13.914926836677648]
We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes.<n>We present a simple algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources.
arXiv Detail & Related papers (2025-05-04T19:14:43Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
2 OLMo 2 Furious [126.72656187302502]
OLMo 2 includes dense autoregressive models with improved architecture and training recipe.<n>Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124.<n>Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size.
arXiv Detail & Related papers (2024-12-31T21:55:10Z)
Llama 3 Meets MoE: Efficient Upcycling [1.8337958765930928]
We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $textbf2%$ improvement in 0-shot accuracy on MMLU. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.
arXiv Detail & Related papers (2024-12-13T08:22:19Z)
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training [45.97480866595295]
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. We adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range.
arXiv Detail & Related papers (2024-05-23T21:00:53Z)
Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs) We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
Large Product Key Memory for Pretrained Language Models [12.932177565788974]
Product key memory (PKM) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead. Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be fine for a wide variety of downstream NLP tasks.
arXiv Detail & Related papers (2020-10-08T10:19:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.