Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
- URL: http://arxiv.org/abs/2510.14717v1
- Date: Thu, 16 Oct 2025 14:17:38 GMT
- Title: Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
- Authors: Alexandru Meterez, Depen Morwani, Jingfeng Wu, Costin-Andrei Oncescu, Cengiz Pehlevan, Sham Kakade,
- Abstract summary: Increasing the batch size during training is a promising strategy to accelerate large language model pretraining.<n>This work develops a principled framework for batch-size scheduling.<n>It introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/sqrt2$ and doubles the batch size.
- Score: 75.36692892951018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Increasing the batch size during training -- a ''batch ramp'' -- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.
Related papers
- Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging [70.05077723711618]
We show that language models trained at 1-32x Chinchilla scale decay with time, with the decay rate determined by the source and capacity conditions of the problem.<n>Our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
arXiv Detail & Related papers (2026-02-03T16:24:05Z) - Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful [69.57125049281993]
This work revisits small batch sizes all the way down to batch size one.<n>We find that small batch sizes train stably and achieve equal or better per-FLOP performance than larger batch sizes.
arXiv Detail & Related papers (2025-07-09T17:57:36Z) - Training Long-Context LLMs Efficiently via Chunk-wise Optimization [60.05884946552877]
We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
arXiv Detail & Related papers (2025-05-22T14:11:34Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - Iteration and Stochastic First-order Oracle Complexities of Stochastic
Gradient Descent using Constant and Decaying Learning Rates [0.8158530638728501]
We show that the performance of descent (SGD) depends on not only the learning rate but also the batch size.
We show that measured critical batch sizes are close to the sizes estimated from our theoretical results.
arXiv Detail & Related papers (2024-02-23T14:24:45Z) - Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory.
We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Existence and Estimation of Critical Batch Size for Training Generative
Adversarial Networks with Two Time-Scale Update Rule [0.2741266294612775]
Previous results have shown that a two time-scale update rule (TTUR) using different learning rates is useful for training generative adversarial networks (GANs) in theory and in practice.
This paper studies the relationship between batch size and the number of steps needed for training GANs with TTURs based on constant learning rates.
arXiv Detail & Related papers (2022-01-28T08:52:01Z) - Automated Learning Rate Scheduler for Large-batch Training [24.20872850681828]
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning.
It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training.
We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
arXiv Detail & Related papers (2021-07-13T05:23:13Z) - AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes.
This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.