Related papers: Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

URL: http://arxiv.org/abs/2602.14208v2
Date: Mon, 23 Feb 2026 05:35:42 GMT
Title: Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws
Authors: Jinbo Wang, Binghui Li, Zhanpeng Zhou, Mingze Wang, Yuxuan Sun, Jiaqi Zhang, Xunliang Cai, Lei Wu,
Abstract summary: Batch size scheduling (BSS) plays a critical role in large-scale deep learning training.<n>We show that the functional scaling law framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS.
Score: 37.651943549758634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining. After switching from small to large batches, the loss rapidly aligns with the constant large-batch trajectory. Using FSL, we show that this effect stems from rapid forgetting of accumulated gradient noise, with the catch-up speed determined by task difficulty. Crucially, this effect implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption. Finally, extensive LLM pretraining experiments -- covering both Dense and MoE architectures with up to 1.1B parameters and 1T tokens -- validate our theoretical predictions. Across all settings, late-switch schedules consistently outperform constant-batch and early-switch baselines.

Related papers

How to Set the Batch Size for Large-Scale Pre-training? [46.58311647781476]
This paper proposes a revised E(S) relationship tailored for the Warmup-Stable-Decay (WSD) learning rate scheduler.<n>Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens.
arXiv Detail & Related papers (2026-01-08T15:43:31Z)
Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence [2.1665689529884697]
emphGreedyLR is a novel scheduler that adaptively adjusts the learning rate during training based on the current loss.<n>Our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence.
arXiv Detail & Related papers (2025-12-16T16:03:52Z)
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z)
Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling [75.36692892951018]
Increasing the batch size during training is a promising strategy to accelerate large language model pretraining.<n>This work develops a principled framework for batch-size scheduling.<n>It introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/sqrt2$ and doubles the batch size.
arXiv Detail & Related papers (2025-10-16T14:17:38Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity [0.6906005491572401]
Batch-size and learning-rate scheduling in computational gradient methods can degrade efficiency and compromise convergence.<n>We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity.<n>Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.
arXiv Detail & Related papers (2025-08-07T11:52:25Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Iteration and Stochastic First-order Oracle Complexities of Stochastic Gradient Descent using Constant and Decaying Learning Rates [0.8158530638728501]
We show that the performance of descent (SGD) depends on not only the learning rate but also the batch size. We show that measured critical batch sizes are close to the sizes estimated from our theoretical results.
arXiv Detail & Related papers (2024-02-23T14:24:45Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Automated Learning Rate Scheduler for Large-batch Training [24.20872850681828]
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning. It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training. We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
arXiv Detail & Related papers (2021-07-13T05:23:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.