Fugu-MT 論文翻訳(概要): Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

論文の概要: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

arxiv url: http://arxiv.org/abs/2605.21486v1
Date: Wed, 20 May 2026 17:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.837663
Title: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Title（参考訳）: ハイパーパラメータ転送の定量化と埋め込み層学習の重要性
Authors: Dayal Singh Kalra, Maissam Barkeshli,
Abstract要約: 我々は、Maximal Update($P)が標準パラメータ化と比較して高品質な学習率転送を提供することを示した。標準パラメータ化に対する$Pの圧倒的な利点は、埋め込み層の学習率を最大化することにある。
参考スコア（独自算出の注目度）: 10.599439539657787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
Abstract（参考訳）: ハイパーパラメータ転送により、小さなものから大規模なものまで最適化されたハイパーパラメータの外挿が可能となり、大きな言語モデル(LLM)のトレーニングに欠かせない。これは、スケーリング法則をハイパーパラメータに適合させるか、最大更新(μ$P)のようなパラメータ化の法則的な選択によって行われる。本稿ではまず,(1)スケーリング法則の品質,(2)外挿誤差に対する堅牢性,(3)パラメータ化の選択による漸近的損失ペナルティの3つの指標によるハイパーパラメータ移動の定量化フレームワークを開発する。次に、μ$Pが標準パラメタライゼーション(SP)と比較して高品質な学習率の伝達を提供すると考えられる理由を、既存の理論が不十分であるとして、包括的に検討する。我々は,AdamW を用いた学習において,組込み層の学習速度を最大化することによる SP に対する$μ$P の圧倒的な利点を見出した。 SPでは、埋め込み層学習率は、トレーニング不安定性を誘導するボトルネックとして機能し、μ$Pの幅を拡大することで、ハイパーパラメータ転送を改善しながら、トレーニングを劇的にスムーズにする。また、重み減衰はスケーリング法則の適合性を改善するが、固定されたトークン/パラメータ設定では外挿の堅牢性に悪影響を及ぼす。

論文の概要: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

関連論文リスト