Fugu-MT 論文翻訳(概要): Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

論文の概要: Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

arxiv url: http://arxiv.org/abs/2505.23971v1
Date: Thu, 29 May 2025 19:53:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-02 19:47:52.64704
Title: Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
Title（参考訳）: 批判的バッチサイズ再考:大規模バッチ言語モデルトレーニングにおける簡単な実証的アプローチ
Authors: William Merrill, Shane Arora, Dirk Groeneveld, Hannaneh Hajishirzi,
Abstract要約: 本稿では,トレーニング中の勾配雑音尺度に基づいて,臨界バッチサイズ(CBS)を推定する方法を示す。 CBSがバッチサイズのウォームアップを動機づけてどのように変化するかについての知見は,小規模なトレーニングランからCBSが大規模トレーニングランを知らせる可能性があることを示唆している。
参考スコア（独自算出の注目度）: 47.40413739584515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases rapidly at first, and then plateaus as training progresses. Furthermore, we find that this trend holds across different model sizes (1B and 7B), suggesting CBS from small training runs can inform larger-scale training runs. Our findings about how the CBS changes over training motivate batch size warmup as a natural way to reliably train language models at large batch size: start the batch size small and increase it as the CBS grows. To validate this claim, we use batch size warmup to train OLMo 1B to slightly better loss than the original training run with 43% fewer gradient steps. This shows how our framework can be applied to reliably train language models at larger batch sizes, increasing data parallelism without compromising performance.
Abstract（参考訳）: 大きなバッチサイズは高速なトレーニングには必要ですが、大きすぎるバッチサイズはトークン効率を損ないます。このトレードオフをナビゲートするために、McMandlish et al (2018) は、トレーニング中の勾配ノイズスケールに基づいて、下記のトレーニングで損失が著しく低下しない臨界バッチサイズ (CBS) を推定できると提案した。例えば、GPT-3のトレーニングでは、CBSのプロキシとして勾配ノイズを正当化するために強い仮定が必要であるため、そのアプローチが実際に信頼されるべきかどうかは不明であり、適用性は制限されている。本稿では,CBSを直接測定し,CBSがトレーニング中にどのように進化していくかを示す,シンプルな実証的なアプローチを紹介する。 OLMoモデルにアプローチを適用すると、CBSは初期化時に0に近づき、最初は急速に増加し、その後、トレーニングが進むにつれてプラトーが成長する。さらに、この傾向は異なるモデルサイズ(1Bと7B)にまたがっており、小さなトレーニングランからCBSがより大規模なトレーニングランを知ることができることを示唆している。我々の発見は、CBSがトレーニングによってバッチサイズのウォームアップを動機付け、大きなバッチサイズで言語モデルを確実に訓練する自然な方法である、というものである。この主張を検証するために、バッチサイズウォームアップを使用してOLMo 1Bをトレーニングし、43%のグラデーションステップで元のトレーニング実行よりも損失をわずかに改善した。このことは、我々のフレームワークがより大きなバッチサイズで言語モデルを確実に訓練し、パフォーマンスを損なうことなくデータの並列性を高める方法を示している。

論文の概要: Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

関連論文リスト