Fugu-MT 論文翻訳(概要): Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

論文の概要: Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

arxiv url: http://arxiv.org/abs/2510.14717v1
Date: Thu, 16 Oct 2025 14:17:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.896482
Title: Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
Title（参考訳）: Seesaw: 学習率のバランスとバッチサイズスケジューリングによるトレーニングの促進
Authors: Alexandru Meterez, Depen Morwani, Jingfeng Wu, Costin-Andrei Oncescu, Cengiz Pehlevan, Sham Kakade,
Abstract要約: トレーニング中のバッチサイズの増加は、大規模な言語モデルの事前トレーニングを加速するための有望な戦略である。この研究はバッチサイズスケジューリングのための原則化されたフレームワークを開発する。標準スケジューラが学習率を半減するたびに、Seesawは1/sqrt2$と倍増し、バッチサイズを倍増します。
参考スコア（独自算出の注目度）: 75.36692892951018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Increasing the batch size during training -- a ''batch ramp'' -- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.
Abstract（参考訳）: トレーニング中のバッチサイズの増加 -- 'バッチランプ' -- は、大規模な言語モデルの事前トレーニングを加速するための有望な戦略である。 SGDでは、バッチサイズを倍にすることは学習率を半減させるのと同じだが、Adamのような適応最適化器の最適戦略は明確ではない。結果として、バッチランプのスケジューリングは、もし全く使われていれば、通常、ヒューリスティックに調整される。標準的なスケジューラが学習率を半減するたびに、Seesawは代わりに1/\sqrt{2}$でそれを乗算し、バッチサイズを2倍にし、シリアルステップを減らしながら損失ダイナミクスを保存する。理論的には、我々は、学習速度減衰とSGDの雑音線形回帰におけるバッチサイズ上昇の間の同値性の最初の有限サンプル証明を提供し、この同値性は、実際に観察された分散支配体制の下で、Adamの誘引可能なプロキシである正規化SGDに拡張する。実証実験では, チンチラスケールで一定の(臨界)バッチサイズを用いて訓練した150M/300M/600Mパラメータモデルにおいて, シーソーはコサイン崩壊をFLOPと等しく一致させ, 壁面時計の時間を$\approx 36\%$に削減し, 解析によって示唆された理論的限界に近づいた。

論文の概要: Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

関連論文リスト