Fugu-MT 論文翻訳(概要): Muon Outperforms Adam in Tail-End Associative Memory Learning

論文の概要: Muon Outperforms Adam in Tail-End Associative Memory Learning

arxiv url: http://arxiv.org/abs/2509.26030v1
Date: Tue, 30 Sep 2025 10:04:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.501411
Title: Muon Outperforms Adam in Tail-End Associative Memory Learning
Title（参考訳）: Muon は Tail-End Associative Memory Learning で Adam より優れている
Authors: Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan,
Abstract要約: 機能埋め込みにかかわらず,Muonはクラス間のバランスの取れた学習を一貫して達成している。我々の経験的観察と理論的分析により、ムオンの核となる利点が明らかとなり、その更新規則は線形連想記憶の外積構造と一致している。
参考スコア（独自算出の注目度）: 118.98991042050532
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.
Abstract（参考訳）: Muonオプティマイザは、Large Language Models (LLMs) のトレーニングにおいて、Adamよりも一貫して高速である。本稿では、連想記憶のレンズを通してこの機構をデミステレーションする。 Muonによって最適化されたトランスフォーマーコンポーネントを非難することにより、LLMの連想メモリパラメータ、すなわち値と出力の重み付け(VO)とフィードフォワードネットワーク(FFN)が、Muonの優位性の主要な貢献者であることを明らかにした。この連想的記憶観に触発されて、本質的に重く、いくつかのクラス(尾類)は他のクラスよりもはるかに少ない頻度で現れる実世界のコーパスに対するムオンの優越性を説明する。優越性は2つの重要な性質を通して説明される。 i)更新規則は一貫してアダムよりも等方性のある特異スペクトルを生じさせ、その結果である。 (ii)重み付きデータでは、Adamよりも効率的にテールクラスを最適化する。実験的な証拠の他に,クラス不均衡データに基づく一層連想記憶モデルの解析により,これらの知見を理論的に検証する。我々は,Muonが特徴埋め込みに関わらず,クラス間のバランスの取れた学習を一貫して達成できることを証明した。この更新規則は、線形連想記憶の外部積構造と一致し、アダムよりも重尾分布の尾クラスのよりバランスよく効果的な学習を可能にする。

論文の概要: Muon Outperforms Adam in Tail-End Associative Memory Learning

関連論文リスト