Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
- URL: http://arxiv.org/abs/2505.23971v3
- Date: Wed, 05 Nov 2025 22:50:55 GMT
- Title: Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
- Authors: William Merrill, Shane Arora, Dirk Groeneveld, Hannaneh Hajishirzi,
- Abstract summary: We show how a critical batch size (CBS) can be estimated based on the gradient noise scale during training.<n>Our findings about how the CBS changes over training motivate batch size warmup, suggesting CBS from small training runs can inform larger-scale training runs.
- Score: 54.860881625923184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases rapidly at first, and then plateaus as training progresses. Furthermore, we find that this trend holds across different model sizes (1B and 7B), suggesting CBS from small training runs can inform larger-scale training runs. Our findings about how the CBS changes over training motivate batch size warmup as a natural way to reliably train language models at large batch size: start the batch size small and increase it as the CBS grows. To validate this claim, we use batch size warmup to train OLMo 1B to slightly better loss than the original training run with 43% fewer gradient steps. This shows how our framework can be applied to reliably train language models at larger batch sizes, increasing data parallelism without compromising performance.
Related papers
- Mapping Post-Training Forgetting in Language Models at Scale [21.32247361921916]
Scaled post-training now drives many of the largest capability gains in language models.<n>We propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs.<n>Our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale.
arXiv Detail & Related papers (2025-10-20T17:35:47Z) - From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining [2.569647910019739]
We study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner.<n>Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
arXiv Detail & Related papers (2025-10-08T00:59:33Z) - Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models [0.41942958779358663]
We propose a predictive framework that models training dynamics and helps optimize resource usage.<n>We derive an empirical scaling law based on model size, initial performance, and training progress.<n>We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance.
arXiv Detail & Related papers (2025-07-24T01:09:25Z) - Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful [71.96579951744897]
Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating accumulation.<n>In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyper parameters to small batch sizes.
arXiv Detail & Related papers (2025-07-09T17:57:36Z) - Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets.<n>We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z) - Data movement limits to frontier model training [0.7234862895932991]
We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled.
A training run exceeding about $1031$ FLOP is infeasible even at low utilization.
arXiv Detail & Related papers (2024-11-02T04:48:41Z) - Beyond Next Token Prediction: Patch-Level Training for Large Language Models [69.67438563485887]
We introduce patch-level training for Large Language Models (LLMs)<n>During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.<n>We show that patch-level training can reduce the overall training costs to 0.5$times$, without compromising the model performance.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - Early Weight Averaging meets High Learning Rates for LLM Pre-training [20.671831210738937]
We show that models trained with high learning rates observe higher gains due to checkpoint averaging.
Our training recipe outperforms conventional training and popular checkpoint averaging baselines.
arXiv Detail & Related papers (2023-06-05T20:51:44Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.