Fugu-MT 論文翻訳(概要): MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

論文の概要: MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

arxiv url: http://arxiv.org/abs/2605.19619v1
Date: Tue, 19 May 2026 09:56:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.265725
Title: MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Title（参考訳）: MiMuon: 大規模モデルの一般化を向上した混合ミューオン最適化
Authors: Feihu Huang, Yuning Luo, Songcan Chen,
Abstract要約: アルゴリズムの安定性と数学的帰納率に基づくMuonの一般化誤差について検討する。そこで我々は,Muonと運動量に基づくSGDのハイブリッドである勾配を用いて,有効混合Muon(MiMuon)を提案する。我々のMiMuonアルゴリズムは、Muonアルゴリズムと同じコンバージェンスレートが$O(frac1NTbig)$である。
参考スコア（独自算出の注目度）: 45.11415579822849
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.
Abstract（参考訳）: 行列構造パラメータは、大きな言語モデルのような多くの人工知能モデルにしばしば現れる。より最近では、大規模モデルの行列パラメータを最適化する効率的なMuonオプティマイザが設計されており、ベクトルワイズアルゴリズムよりもはるかに高速な収束を示している。いくつかの研究は、ムオン最適化器の収束特性(すなわち最適化誤差)の研究を始めているが、その一般化特性(すなわち一般化誤差)はまだ確立されていない。そこで本研究では,アルゴリズム安定性と数学的帰納率に基づいて,Muonオプティマイザの一般化誤差について検討し,$O\big(\frac{1}{Nκ^{T}}\big)$,$N$はトレーニングサンプルサイズであり,$T$は反復数であり,$κ>0$は勾配推定の特異値の最小差を表す。ムオンの一般化を促進するために,ムオンと運動量に基づくSGD最適化器のハイブリッドである勾配の直交化を慎重に用いて,有効混合ムオン(MiMuon)最適化器を提案する。次に、我々のMiMuon Optimizationrが$O\big(\frac{1}{N}\big)$よりも低い一般化誤差を持つことを示す。一方、我々のMiMuonアルゴリズムの収束特性についても検討し、我々のMiMuonアルゴリズムがMuonアルゴリズムと同じ収束率$O(\frac{1}{T^{1/4}})$であることを証明する。 Qwen3-0.6B や YOLO26m などの大規模モデルのトレーニング実験では、MiMuonオプティマイザの効率が示されている。

論文の概要: MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

関連論文リスト