Fugu-MT 論文翻訳(概要): One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

論文の概要: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

arxiv url: http://arxiv.org/abs/2605.22297v1
Date: Thu, 21 May 2026 10:46:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.21517
Title: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Title（参考訳）: 1つのLRが全てを満たさない:LLMのための重機誘導層学習率
Authors: Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu,
Abstract要約: レイヤワイズラーニングレート(レイヤワイズラーニングレート、Layerwise Learning Rate、LLR)は、個々のトランスフォーマー層に異なるラーニングレートを割り当てる適応型スキームである。 LLRは階層間のバランスの取れたトレーニングを促進し、より高速な収束と一般化の改善につながる。
参考スコア（独自算出の注目度）: 19.49856488618013
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M-1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero-shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.
Abstract（参考訳）: 学習率の設定は、現代のディープラーニングの基本的な側面である。すべての層に一様学習率を適用するという一般的な実践はトランスフォーマーの構造的不均一性を見落としており、大きな言語モデル(LLM)のバックボーンとしての有効性を制限している可能性がある。本稿では,個々のトランスフォーマー層に異なる学習率を割り当てる適応型スキームであるLayerwise Learning Rate (LLR)を紹介する。重み付き自己正則化(HT-SR)理論は,重み相関行列の実験的スペクトル密度(ESD)を特徴付けるものであり,重み付き自己正則化(HT-SR)理論に基づいている。重い尾の弱い層はトレーニングを加速するためにより大きな学習率を割り当て、重い尾の弱い層はより少ない学習率を受け取る。このように学習率を調整することにより、LLRは階層間のバランスの取れたトレーニングを促進し、より高速な収束と一般化の改善につながる。アーキテクチャ(LLaMAからGPT-nanoまで)、オプティマイザ(AdamWとMuon)、パラメータスケール(60M-1B)の広範な実験により、LLRは最大1.5倍のトレーニングスピードアップを達成し、ベースラインを上回り、平均ゼロショット精度は47.09%から49.02%に向上した。 LLRの主な利点は、チューニングのオーバーヘッドが低いことである。コードはhttps://github.com/hed-ucas/Layer-wise-Learning-Rateで公開されている。

論文の概要: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

関連論文リスト