Fugu-MT 論文翻訳(概要): Why Muon Outperforms Adam: A Curvature Perspective

論文の概要: Why Muon Outperforms Adam: A Curvature Perspective

arxiv url: http://arxiv.org/abs/2606.04662v1
Date: Wed, 03 Jun 2026 09:40:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.660515
Title: Why Muon Outperforms Adam: A Curvature Perspective
Title（参考訳）: MuonがAdamより優れている理由
Authors: Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, Zhuoran Yang,
Abstract要約: Muonは、大規模な言語モデルトレーニングにおいて、Adamよりもトレーニング効率を約2倍改善する。私たちの研究は、Adamに対するMuonの優位性を曲率の観点から軽視する第一歩を踏み出します。
参考スコア（独自算出の注目度）: 49.85900116602357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.
Abstract（参考訳）: Muon は大規模な言語モデルトレーニングにおいて、Adam よりも2倍の訓練効率を向上させるが、この利点の局所的な幾何学的源泉はいまだ不明である。私たちの研究は、Adamに対するMuonの優位性を曲率の観点から軽視する第一歩を踏み出します。まず、トレーニングランドスケープに2次テイラー近似を適用し、一致した検証損失において、MuonがAdamよりも大きな1段階の損失減少を達成することを示す。 2つのオプティマイザは同等の1次ゲインを持つが、ムーンは常に2次曲率のペナルティを小さくする。第二に、この曲率ペナルティを正方形更新ノルムと正規化方向シャープネス(NDS)に分解する。 MuonとAdamは同等のアップデートノルムを持っているので、Muonのより小さな曲率ペナルティは、更新スケールではなく低いNDSによって駆動される。第3に、トレーニングデータとモデル構造が、MuonのNDSの利点をどのように形成するかを検討する。 Zipf-probabilistic Context-free Grammar (PCFG) データを制御不均衡で使用することにより,データ不均衡がAdamに対するMuonのNDS優位性を増幅することを示す。さらに、内部/層間分解は、トレーニングの中期および後期において、Muonの低いNDSは、主により小さい層内曲率によって持続されることを示す。経験的エビデンス以外にも、不均一な曲率と高曲率モードへの勾配アライメントを伴うスタイル化された二次問題を解析する。また, 曲率の不均一性が十分に強い場合には, 同じステップ数で局所的な二次的損失を減少させる。

論文の概要: Why Muon Outperforms Adam: A Curvature Perspective

関連論文リスト