Fugu-MT 論文翻訳(概要): To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

論文の概要: To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

arxiv url: http://arxiv.org/abs/2603.00742v1
Date: Sat, 28 Feb 2026 17:37:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-03 19:50:56.346585
Title: To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters
Title（参考訳）: Muonを使うか使わないか - 最適化におけるシンプルさのバイアスはいかに重要か
Authors: Sara Dragutinović, Rajesh Ranganath,
Abstract要約: Muonはおそらく、トレーニング速度が優れているため、最も人気がある。本稿では、このスピードアップを駆動するメカニズムから生じる潜在的な欠点について検討する。 Muonはタスク間の共通基盤構造を明らかにするのに苦労しており、刺激的な特徴に適合する傾向にある。
参考スコア（独自算出の注目度）: 16.624341041698013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more thoroughly studied methods like Stochastic Gradient Descent (SGD). We take first steps toward understanding consequences this may have: Muon might struggle to uncover common underlying structure across tasks, and be more prone to fitting spurious features. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model's behavior -- for better or for worse.
Abstract（参考訳）: 長い間、Adamはディープニューラルネットワークをトレーニングするためのユビキタスなデフォルトの選択肢として機能してきた。最近、多くの新しいオプティマイザが導入されており、その内、トレーニング速度が優れているため、おそらくMuonが最も人気を博している。ムオンの利点を検証するために多くの論文が作成されているが、このスピードアップを駆動するメカニズムから生じる潜在的な欠点について検討する。我々は,Muonで最適化した際のバイアスを探索し,理論解析と学習軌跡と学習した解に対する結果を提供する。この理論は、Muonがもたらす利点を正当化するものであるが、Muon最適化モデルに欠点があるいくつかの例を思いついたときの直感も導く。私たちが強調する中核的な問題は、Muon最適化がStochastic Gradient Descent (SGD)のようなより古い、より徹底的に研究された方法によって自然に保存される単純さのバイアスを取り除くことである。 Muonはタスク間の共通基盤構造を明らかにするのに苦労し、刺激的な機能に適合する傾向があります。より広範に、この論文はリマインダーとして役立ちます:新しいオプティマイザを開発するとき、これらのバイアスがモデルの振る舞いを -- 良くも悪くも -- 根本的に変えることができるので、彼らが導入するバイアスを考慮することが不可欠です。

論文の概要: To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

関連論文リスト