Fugu-MT 論文翻訳(概要): Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

論文の概要: Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

arxiv url: http://arxiv.org/abs/2410.05192v1
Date: Tue, 29 Oct 2024 06:26:00 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-01 23:49:12.236287
Title: Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Title（参考訳）: Warmup-Stable-Decay学習率の理解:川流域は景観を損なう
Authors: Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma,
Abstract要約: Warmup-Stable-Decay (WSD) スケジュールは、一定の学習率を使用して、所定の計算予算なしで無限に継続できるイテレーションのメインブランチを生成する。プレトレーニング損失は,河底に川がある深い谷に類似した河谷景観を呈することを示す。この理論にインスパイアされたWSD-Sは、従来のチェックポイントの崩壊フェーズを再利用し、メインブランチを1つだけ保持するWSDの変種である。
参考スコア（独自算出の注目度）: 66.80315289020487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
Abstract（参考訳）: 訓練言語モデルは、典型的なコサイン学習率のスケジュールがステップの総数に依存するため、現在、固定された計算予算を事前に決定する必要がある。対照的に、Warmup-Stable-Decay(WSD)スケジュールは、一定の学習率を使用して、原則として、所定の計算予算なしで、無限に継続できるイテレーションのメインブランチを生成する。すると、計算予算が与えられたら、いつでも、急速に減衰する学習率でメインブランチから分岐して、強いモデルを生成することができる。経験的には、WSDは非伝統的な損失曲線を生成し、この損失は安定相では上昇するが、崩壊相では急激に減少する。この現象を説明するために,プレトレーニング損失は河底に川がある深い谷に類似した川谷の景観を示すと推測する。この仮定では, 安定期には, 高い学習速度で繰り返し振動するが, 川に沿って急速に進行することを示す。崩壊期には、急速に低下する学習速度は、繰り返しの振動を最小化し、それを川に近づけ、真の最適化の進展を明らかにする。したがって, 持続的高次学習期と高速崩壊期は, それぞれ河川の進行と山道の進行に寄与し, どちらも重要な役割を担っている。本分析は、経験的観測と整合した現象を予測し、この景観が単純な2グラムのデータセット上で事前学習から生まれることを示す。この理論にインスパイアされたWSD-Sは、従来のチェックポイントの崩壊フェーズを再利用し、メインブランチを1つだけ保持し、崩壊したチェックポイントから再開するWSDの変種である。 WSD-S は WSD と Cyclic-Cosine を実証的に上回り、0.1B から1.2B までのパラメータを1回の実行で、様々な計算予算で複数の言語モデルチェックポイントを得る。

論文の概要: Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

関連論文リスト