Fugu-MT 論文翻訳(概要): Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

論文の概要: Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

arxiv url: http://arxiv.org/abs/2602.14208v1
Date: Sun, 15 Feb 2026 16:06:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-17 16:22:49.737362
Title: Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws
Title（参考訳）: 高速バッチアップ、遅延スイッチング:関数スケーリング法則による最適バッチサイズスケジューリング
Authors: Jinbo Wang, Binghui Li, Zhanpeng Zhou, Mingze Wang, Yuxuan Sun, Jiaqi Zhang, Xunliang Cai, Lei Wu,
Abstract要約: バッチサイズスケジューリング(BSS)は、大規模ディープラーニングトレーニングにおいて重要な役割を果たす。 We show that the functional scaling law framework introduced in Li et al. (2025a) provided a principled lens for analysis BSS。
参考スコア（独自算出の注目度）: 37.651943549758634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining. After switching from small to large batches, the loss rapidly aligns with the constant large-batch trajectory. Using FSL, we show that this effect stems from rapid forgetting of accumulated gradient noise, with the catch-up speed determined by task difficulty. Crucially, this effect implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption. Finally, extensive LLM pretraining experiments -- covering both Dense and MoE architectures with up to 1.1B parameters and 1T tokens -- validate our theoretical predictions. Across all settings, late-switch schedules consistently outperform constant-batch and early-switch baselines.
Abstract（参考訳）: バッチサイズスケジューリング(BSS)は、大規模ディープラーニングトレーニングにおいて重要な役割を担い、最適化力学と計算効率の両方に影響を与える。しかし、その理論的な基礎はいまだに理解されていない。本研究では,Li et al (2025a) で導入された関数スケーリング法(FSL)フレームワークが,BSSを解析するための原理化されたレンズを提供することを示す。具体的には、固定データ予算の下で最適なBSSを特徴付けるとともに、その構造がタスクの難易度に大きく依存していることを示す。簡単なタスクでは、最適なスケジュールはバッチサイズを拡大し続ける。対照的に、ハードタスクの場合、最適スケジュールはトレーニングの大部分で小さなバッチサイズを維持し、後期にのみ大きなバッチに切り替える。遅延スイッチングの出現を説明するため、我々は、大きな言語モデル(LLM)事前トレーニングにも現れる動的メカニズム、すなわち高速なキャッチアップ効果を明らかにする。小さなバッチから大きなバッチに切り替えた後、損失は一定の大バッチ軌道と急速に一致する。 FSLを用いて、この効果は、タスクの難易度によって決定されるキャッチアップ速度により、蓄積した勾配雑音を迅速に忘れることに起因することを示す。この効果は、データ消費を大幅に削減しつつ、パフォーマンスを犠牲にすることなく、大規模なバッチを遅延トレーニングに安全に遅延させることができることを意味している。最後に、DenseとMoEアーキテクチャの両方を最大1.1Bパラメータと1Tトークンでカバーする広範囲なLLM事前トレーニング実験は、我々の理論予測を検証する。すべての設定において、遅延スウィッチスケジュールはコンスタントバッチとアーリースウィッチベースラインを一貫して上回る。

論文の概要: Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

関連論文リスト