Related papers: Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

URL: http://arxiv.org/abs/2508.05297v1
Date: Thu, 07 Aug 2025 11:52:25 GMT
Title: Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity
Authors: Hikaru Umeda, Hideaki Iiduka,
Abstract summary: Batch-size and learning-rate scheduling in computational gradient methods can degrade efficiency and compromise convergence.<n>We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity.<n>Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.
Score: 0.6906005491572401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic gradient methods. However, naive scheduling of these hyperparameters can degrade optimization efficiency and compromise generalization. Motivated by recent theoretical insights, we investigated how the batch size and learning rate should be increased during training to balance efficiency and convergence. We analyzed this problem on the basis of stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations needed to reach an $\epsilon$-approximate stationary point of the empirical loss. We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity and validated them through extensive experiments. Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.

Related papers

Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity [0.6906005491572401]
The convergence behavior of mini-batch gradient descent (SGD) is highly sensitive to the batch size and learning rate settings.<n>Recent theoretical studies have identified the existence of a critical batch size that minimizes first-order oracle complexity.<n>An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size.
arXiv Detail & Related papers (2025-08-07T12:00:53Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity [89.81738321188391]
This study investigates the relationship between task complexity and optimal sparsity in SMoE models.<n>We show that the optimal sparsity lies between minimal activation (1-2 experts) and full activation, with the exact number scaling proportionally to task complexity.
arXiv Detail & Related papers (2024-10-17T18:40:48Z)
Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One [0.0]
We show that deep learnings using small constant learning rates, hyper parameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions. Results indicate that Adam using a small constant learning rate, hyper parameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and gradient descent.
arXiv Detail & Related papers (2022-08-21T06:11:23Z)
Hyper-Learning for Gradient-Based Batch Size Adaptation [2.944323057176686]
Scheduling the batch size to increase is an effective strategy to control noise when training deep neural networks. We introduce Arbiter as a new hyper-optimization algorithm to perform batch size adaptations for learnable schedulings. We demonstrate Arbiter's effectiveness in several illustrative experiments.
arXiv Detail & Related papers (2022-05-17T11:01:14Z)
Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness. We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions. Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour. We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.