Fugu-MT 論文翻訳(概要): Spectral Scaling Laws of Muon

論文の概要: Spectral Scaling Laws of Muon

arxiv url: http://arxiv.org/abs/2606.04058v1
Date: Tue, 02 Jun 2026 11:31:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.275991
Title: Spectral Scaling Laws of Muon
Title（参考訳）: ムオンのスペクトルスケーリング法則
Authors: Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar,
Abstract要約: 運動量行列の特異値スペクトルがトレーニング中にどのように振る舞うかを考察する。 77Mから2.8Bパラメータのモデルにおいて、運動量バッファの特異値の量子化を追跡する。我々の法則は、実践者に対して、重要な方向を直交する最小のNS構成を選択するための、原則化されたレイヤー対応のレシピを提供します。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.
Abstract（参考訳）: オーソノーマライズされた更新ルールは、Muonを採用した最近のオープンソースの最先端モデルとともに、大規模言語モデルのトレーニングのための最適化の主要な選択肢となっている。これらの更新をトラクタブルに保つため、MuonはNewton--Schulz (NS) イテレーションで正則化を行う。 NS は近似であるから、小さな特異値を持つ方向は正規化されない。ムーオンでは、NSは各ステップで運動量行列に適用されるが、これらの運動量行列の特異値スペクトルがトレーニング中にどのように振る舞うか、その振る舞いがモデルサイズによってどのように変化するかについてはほとんど分かっていない。この問題に関する最初の体系的研究について述べる。 77Mから2.8Bまでのモデルにおいて, 運動量バッファの特異値の量子化を追跡することで, 短いバーンインの後, 量子化は層タイプとモデルサイズによって決定された値で安定化する。これらの安定化値は、層依存指数を持つモデルサイズにおける驚くほどクリーンなパワー則に従っている。モデルサイズが$M$(約$M^{-0.25}$)と非常に緩やかにスケールするので、学術規模で使用される標準の5ステップNS構成は、ずっと大きなスケールでそれらを正規化し続けるだろう。しかし、後期層のいくつかはより積極的にスケールし(最大$M^{-0.96}$)、より多くのNSイテレーションやより良いチューニング係数を使用しない限り、フロンティアスケールでNS障害状態に陥る。 NSイテレーションは大規模に計算コストがかかる – 当社の法律では,重要な方向を直交する最小のNS構成を選択するための,原則化されたレイヤ対応のレシピを提供しています – 更新品質を犠牲にすることなく,不要な計算を回避しています。

論文の概要: Spectral Scaling Laws of Muon

関連論文リスト