Related papers: Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

URL: http://arxiv.org/abs/2507.07101v1
Date: Wed, 09 Jul 2025 17:57:36 GMT
Title: Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Authors: Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, Micah Goldblum,
Abstract summary: Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating accumulation.<n>In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyper parameters to small batch sizes.
Score: 71.96579951744897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas, bottlenecked by inter-device bandwidth.

Related papers

APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.<n>Various memory-efficient Scals have been proposed to reduce memory usage.<n>They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler [34.416299887009195]
We study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. We propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models.
arXiv Detail & Related papers (2024-08-23T20:22:20Z)
Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures [41.59038171479036]
Training with larger mini-batches becomes prohibitive for Large Language Models (LLMs)<n>We propose Coresets for Training LLMs (CoLM)<n>CoLM reduces the memory requirement of fine-tuning by 2x and even outperforms training with 4x larger mini-batches.
arXiv Detail & Related papers (2024-07-28T20:39:16Z)
Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters. We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance [0.22499166814992438]
Recent deep learning models are difficult to train using a large batch size. Machines may not have enough memory to accommodate both the model and a large data batch size. This paper proposes a method called Micro-Batch Processing (MBP) to address this problem.
arXiv Detail & Related papers (2021-10-24T16:38:05Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
Concurrent Adversarial Learning for Large-Batch Training [83.55868483681748]
Adversarial learning is a natural choice for smoothing the decision surface and biasing towards a flat region. We propose a novel Concurrent Adversarial Learning (ConAdv) method that decouples the sequential gradient computations in adversarial learning by utilizing staled parameters. This is the first work successfully scales ResNet-50 training batch size to 96K.
arXiv Detail & Related papers (2021-06-01T04:26:02Z)
AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z)
Stagewise Enlargement of Batch Size for SGD-based Learning [20.212176652894495]
Existing research shows that the batch size can seriously affect the performance of gradient descent(SGD) based learning. We propose a novel method, called underlinestagewise underlineenlargement of underlinebatch underlinesize(mboxSEBS), to set proper batch size for SGD.
arXiv Detail & Related papers (2020-02-26T16:40:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.