LiMuon: Light and Fast Muon Optimizer for Large Models
- URL: http://arxiv.org/abs/2509.14562v2
- Date: Fri, 19 Sep 2025 07:40:32 GMT
- Title: LiMuon: Light and Fast Muon Optimizer for Large Models
- Authors: Feihu Huang, Yuning Luo, Songcan Chen,
- Abstract summary: We propose a useful Muon for training large models.<n>Our LiMuon has a lower memory than the current Muon and its variants.<n>We prove that our LiMuon has a sample $O(epsilon-3)$ under the generalized smooth condition.
- Score: 45.11415579822849
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large models recently are widely applied in artificial intelligence, so efficient training of large models has received widespread attention. More recently, a useful Muon optimizer is specifically designed for matrix-structured parameters of large models. Although some works have begun to studying Muon optimizer, the existing Muon and its variants still suffer from high sample complexity or high memory for large models. To fill this gap, we propose a light and fast Muon (LiMuon) optimizer for training large models, which builds on the momentum-based variance reduced technique and randomized Singular Value Decomposition (SVD). Our LiMuon optimizer has a lower memory than the current Muon and its variants. Moreover, we prove that our LiMuon has a lower sample complexity of $O(\epsilon^{-3})$ for finding an $\epsilon$-stationary solution of non-convex stochastic optimization under the smooth condition. Recently, the existing convergence analysis of Muon optimizer mainly relies on the strict Lipschitz smooth assumption, while some artificial intelligence tasks such as training large language models (LLMs) do not satisfy this condition. We also proved that our LiMuon optimizer has a sample complexity of $O(\epsilon^{-3})$ under the generalized smooth condition. Numerical experimental results on training DistilGPT2 and ViT models verify efficiency of our LiMuon optimizer.
Related papers
- Extending $μ$P: Spectral Conditions for Feature Learning Across Optimizers [3.5708391029226885]
We propose a novel framework to derive $$P for a broader class of derivations, including AdamW, AD, LAMB, Sophia, Shampoo and Muon.<n>We implement our $$Ps on multiple benchmark models and demonstrate zero-shot learning rate transfer across increasing model width.
arXiv Detail & Related papers (2026-02-24T14:17:51Z) - Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z) - Beyond the Ideal: Analyzing the Inexact Muon Update [54.70108543057578]
We show first analysis of the inexactized update at Muon's core.<n>We reveal a fundamental coupling between this inexactness and the optimal step size and momentum.
arXiv Detail & Related papers (2025-10-22T18:01:07Z) - NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z) - On the Convergence of Muon and Beyond [31.900178928104648]
We provide the first proof that variance reduction enables Muon-MVR2 to attain the optimal complexity.<n>Overall, this work offers the first proof of optimality for a Muon-style.
arXiv Detail & Related papers (2025-09-19T09:43:37Z) - AdaMuon: Adaptive Muon Optimizer [11.281916426508216]
AdaMuon combines element-wise adaptivity with orthogonal updates for large-scale neural network training.<n>AdaMuon maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.
arXiv Detail & Related papers (2025-07-15T05:49:37Z) - Reparameterized LLM Training via Orthogonal Equivalence Transformation [54.80172809738605]
We present POET, a novel training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons.<n>POET can stably optimize the objective function with improved generalization.<n>We develop efficient approximations that make POET flexible and scalable for training large-scale neural networks.
arXiv Detail & Related papers (2025-06-09T17:59:34Z) - Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [38.99428012275441]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z) - Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Mnemosyne: Learning to Train Transformers with Transformers [18.36543176998175]
We show that Mnemosyne can successfully train Transformers while using simple meta-training strategies that require minimal computational resources.
Mnemosyne provides space comparable complexity to that its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters.
arXiv Detail & Related papers (2023-02-02T14:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.