Fugu-MT 論文翻訳(概要): One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

論文の概要: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

arxiv url: http://arxiv.org/abs/2605.22297v2
Date: Tue, 26 May 2026 06:04:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:40.885517
Title: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Title（参考訳）: 1つのLRが全てを満たさない:LLMのための重機誘導層学習率
Authors: Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu,
Abstract要約: レイヤワイズラーニングレート(レイヤワイズラーニングレート、Layerwise Learning Rate、LLR)は、個々のトランスフォーマー層に異なるラーニングレートを割り当てる適応型スキームである。 LLRは最大1.5倍のトレーニングスピードアップを実現し、一様学習率ベースラインを一貫して上回る。
参考スコア（独自算出の注目度）: 19.49856488618013
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes more balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures ranging from LLaMA to GPT-nano, optimizers including AdamW and Muon, and model scales from 60M to 3B parameters with up to 100B training tokens demonstrate the effectiveness of LLR. LLR achieves up to 1.5x training speedup and consistently outperforms uniform-learning-rate baselines. In particular, it improves the average zero-shot accuracy of 1B models from 47.09% to 49.02%, and that of 3B models from 48.58% to 50.61%. A key advantage of LLR is its low tuning overhead: it can transfer nearly optimal learning-rate settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.
Abstract（参考訳）: 学習率の設定は、現代のディープラーニングの基本的な側面である。すべての層に一様学習率を適用するという一般的な実践はトランスフォーマーの構造的不均一性を見落としており、大きな言語モデル(LLM)のバックボーンとしての有効性を制限している可能性がある。本稿では,個々のトランスフォーマー層に異なる学習率を割り当てる適応型スキームであるLayerwise Learning Rate (LLR)を紹介する。重み付き自己正則化(HT-SR)理論は,重み相関行列の実験的スペクトル密度(ESD)を特徴付けるものであり,重み付き自己正則化(HT-SR)理論に基づいている。より弱い重み付けの層はトレーニングを加速するためにより大きな学習率を割り当て、強い重み付けの層はより少ない学習率を受け取る。このように学習率を調整することにより、LLRは階層間のバランスの取れたトレーニングを促進し、より早く収束し、一般化を向上する。 LLaMAからGPT-nanoまでのアーキテクチャ、AdamWやMuonなどのオプティマイザ、最大100Bのトレーニングトークンを持つ60Mから3Bパラメータのモデルスケールの広範な実験は、LLRの有効性を示している。 LLRは最大1.5倍のトレーニングスピードアップを実現し、一様学習率ベースラインを一貫して上回る。特に、1Bモデルの平均ゼロショット精度を47.09%から49.02%に改善し、3Bモデルの平均ゼロショット精度を48.58%から50.61%に改善した。 LLRの主な利点は、チューニングのオーバーヘッドが低いことである。コードはhttps://github.com/hed-ucas/Layer-wise-Learning-Rateで公開されている。

論文の概要: One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

関連論文リスト