Fugu-MT 論文翻訳(概要): Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

論文の概要: Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

arxiv url: http://arxiv.org/abs/2512.05620v1
Date: Fri, 05 Dec 2025 11:03:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-13 22:40:57.000315
Title: Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales
Title（参考訳）: マトリックスプレコンディショニングオプティマイザの連続ゲインを実現するハイパーパラメータ転送
Authors: Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, Andrew Gordon Wilson,
Abstract要約: 学習速度と減量率の最適化は,幅広い言語に対して,モデルの幅と深さでどのようにスケールするかを検討する。我々は、$Pによる学習率のスケーリングは転送を改善するが、それでもかなりの有限幅偏差に悩まされる可能性があることを見出した。計算-最適スケーリングでは、独立したウェイト崩壊が1/mathrmwidth$で言語間でほぼ最適であることが分かる。
参考スコア（独自算出の注目度）: 55.91454326946738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as $μ$P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to $μ$P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve $1.4\times$ and $1.3\times$ speedup over AdamW for training Llama-architecture language models of sizes ranging from $190$M to $1.4$B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.
Abstract（参考訳）: 最近導入されたいくつかのディープラーニングオプティマイザは、行列レベルのプリコンディショニングを利用しており、特に比較的小規模な実験において、現在の支配的なオプティマイザAdamWと比較して、有望なスピードアップを示している。しかし、彼らの成功を検証し、再現しようとする試みは、様々な結果を報告している。これらのオプティマイザの大規模化の有効性をよりよく理解するために,本研究では,μ$Pなどの先行作業に基づいて,ハイパーパラメータ転送による事前条件付きオプティマイザのスケール方法を検討する。ブロッキングやグラフトなどの一般的なテクニックの影響を考慮し, 最適学習率と重み劣化は, シャンプー, SOAP, Muonなど, 幅広い最適化者に対して, モデル幅と深さでどのようにスケールするかを検討する。 μ$Pによる学習率のスケーリングは転送を改善するが、ドリフトする最適学習率を引き起こす大きな有限幅偏差に悩まされ、ブロッキングと明示的なスペクトル正規化によって緩和できることを示す。計算最適スケーリングでは、1/\mathrm{width}$はオプティマイザ間でほぼ最適である。これらのスケーリングルールを適用すると、MuonとShampooは一貫して1.4\times$と1.3\times$AdamWよりもスピードアップして190$Mから1.4$BまでのLlamaアーキテクチャ言語モデルをトレーニングしています。これらの結果とさらなる改善に基づき、現実的なチューニング予算が与えられた場合、最適化器をスケールで確実に比較するために最適なハイパーパラメータ転送を研究することが不可欠である、と論じる。

論文の概要: Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

関連論文リスト