Fugu-MT 論文翻訳(概要): Rethinking Language Model Scaling under Transferable Hypersphere Optimization

論文の概要: Rethinking Language Model Scaling under Transferable Hypersphere Optimization

arxiv url: http://arxiv.org/abs/2603.28743v2
Date: Sun, 05 Apr 2026 02:15:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 12:54:54.544297
Title: Rethinking Language Model Scaling under Transferable Hypersphere Optimization
Title（参考訳）: トランスファーブルハイパースフィア最適化に基づく言語モデルのスケーリング再考
Authors: Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen,
Abstract要約: モデル幅、深さ、トレーニングトークン、エキスパート・オブ・エキスパート(MoE)間で最適な学習率を転送する最初のフレームワークであるHyperPを紹介します。単一のベースレートでHyperPの計算予算をまたいだデータ転送を調整し、強力な Muon ベースラインを 6times1021$ FLOPs で1.58 タイムで達成した。また、超球面制約から派生したMoEゲーティング機構であるSqrtGateを提案し、MoEの粒度にわたって出力RMSを保存する。
参考スコア（独自算出の注目度）: 67.38433364607897
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.
Abstract（参考訳）: 大規模言語モデルのスケーリング法則は最適化とパラメータ化に大きく依存する。既存のハイパーパラメータ転送法は主に1次オプティマイザ向けに開発されており、大規模なトレーニング不安定を構造的に防止するものではない。最近のハイパースフィア最適化手法は、より安定したスケーリングのための有望な代替手段を提供する固定ノルム超スフィアに重み行列を制限している。モデル幅,深さ,トレーニングトークン,およびME(Mixture-of-Experts)の粒度を,MuonオプティマイザとのFrobenius-sphere制約の下で転送する最初のフレームワークであるHyperP(Hypersphere Parameterization)を導入する。我々は、ウェイト崩壊がフロベニウス球面上の一階のno-opであることを証明し、Depth-$μ$P が依然として必要であることを証明し、最適な学習速度が、AdamW が以前に観測した「魔法の指数」 0.32 と同じデータスケーリングパワー則に従うことを発見した。シングルベースラーニングレートは、HyperP の全ての計算予算で最小のスケールで調整され、強力な Muon ベースライン上での計算効率は 6\times10^{21}$ FLOPs である。さらに、HyperPは、転送可能な安定性を提供する:$Z$-values、出力RMS、アクティベーションアウトリヤを含む、監視対象の不安定性インジケータはすべて、FLOPスケーリングのトレーニングにおいて、バウンダリと非増加を継続する。 SqrtGateも提案する。この機構は,MoE粒度全体にわたって出力RMSを保ち,粒度スケーリングを改善するための超球ゲーティング機構であり,超球最適化により,より大きな補助負荷分散重みが実現でき,高い性能と優れた専門家のバランスが得られることを示す。トレーニングコードベースはhttps://github.com/microsoft/ArchScale.comで公開しています。

論文の概要: Rethinking Language Model Scaling under Transferable Hypersphere Optimization

関連論文リスト