Fugu-MT 論文翻訳(概要): How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

論文の概要: How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

arxiv url: http://arxiv.org/abs/2605.14200v1
Date: Wed, 13 May 2026 23:32:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.539827
Title: How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Title（参考訳）: muPから最大スケール安定パラメータ化へ
Authors: Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia,
Abstract要約: 3つの異なるスケーリング体制を解析することで、このギャップを解決するための原則的な一歩を踏み出します。各体制に対して,MoEsの制限的トレーニング力学に関する新しい力学平均場理論(DMFT)を考案する。結果として生じる$Pの処方は、スケールや頑健な学習速度の移動による単調な改善を確実に引き起こさないことを示す。
参考スコア（独自算出の注目度）: 45.69980208532521
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.
Abstract（参考訳）: 最近のフロンティアの大規模言語モデルは、主にMixture-of-Experts (MoE)アーキテクチャに依存している。経験的な進歩にもかかわらず、ハイパーパラメーターがネットワーク幅$N$、エキスパート幅$N_e$、エキスパートの数$M$、スパシティ$K$、ディープ$L$でどのようにスケールすべきかという原則的な理解はいまだに存在しない。 I) $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, (III) full proportional scaling of $N, N_e, M$, $K$。各体制に対して、我々は、我々の分析の正式な基礎となるMoEsの制限的トレーニング力学について、新しい力学平均場理論(DMFT)を策定する。このフレームワーク内では、SGDとAdamがすべての最大更新(μ$)デシダラタを満たす一意のパラメータ化を導出する。以上の結果から,μ$Pの処方は,スケールや頑健な学習速度の移動による単調な改善を確実に引き起こさないことが明らかとなった。これらの病理は凝集力学におけるスケール依存的な可観測物に辿り着くが、これは最大スケール安定性(maximal scale stability)と呼ぶデシラタ(deiderata)の洗練された集合を動機付けている。この原理により、SGDとAdamの3つのスケーリングレジームすべてに対して最大スケール安定パラメータ化(MSSP)を導出し、DMFT解析によって対応する制限力学($μ$Pの極限とは質的に異なる)を特徴づける。実験では、MSSPが学習率の伝達と単調な改善をレジームの規模で確実に回復することを確認した。既存のDeep-scaling理論と組み合わせて、これらの結果は、幅、深さ、専門家の幅、専門家の数という関数として、MoEアーキテクチャの完全なスケーリング基準を提供する。

論文の概要: How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

関連論文リスト