Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
- URL: http://arxiv.org/abs/2508.03105v2
- Date: Fri, 10 Oct 2025 04:04:37 GMT
- Title: Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
- Authors: Yuichi Kondo, Hideaki Iiduka,
- Abstract summary: We analyze the convergence behavior of gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules.<n>We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning.<n>Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay.
- Score: 7.2620484413601325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
Related papers
- Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling [75.36692892951018]
Increasing the batch size during training is a promising strategy to accelerate large language model pretraining.<n>This work develops a principled framework for batch-size scheduling.<n>It introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/sqrt2$ and doubles the batch size.
arXiv Detail & Related papers (2025-10-16T14:17:38Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity [0.6906005491572401]
The convergence behavior of mini-batch gradient descent (SGD) is highly sensitive to the batch size and learning rate settings.<n>Recent theoretical studies have identified the existence of a critical batch size that minimizes first-order oracle complexity.<n>An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size.
arXiv Detail & Related papers (2025-08-07T12:00:53Z) - Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity [0.6906005491572401]
Batch-size and learning-rate scheduling in computational gradient methods can degrade efficiency and compromise convergence.<n>We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity.<n>Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.
arXiv Detail & Related papers (2025-08-07T11:52:25Z) - WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z) - Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size [7.2620484413601325]
Momentum methods were originally introduced for their superiority to gradientbatch descent (SGD) in deterministic settings with convex functions.<n>We show that achieving convergence requires either a decaying learning rate or an increasing batch size.
arXiv Detail & Related papers (2025-06-30T06:31:30Z) - Similarity Matching Networks: Hebbian Learning and Convergence Over Multiple Time Scales [5.093257685701887]
We consider and analyze the emphsimilarity matching network for principal subspace projection.<n>By leveraging a multilevel optimization framework, we prove convergence of the dynamics in the offline setting.
arXiv Detail & Related papers (2025-06-06T14:46:22Z) - Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum [7.2620484413601325]
gradient descent with momentum (SGDM) has been well studied in both theory and practice.<n>We show theoretically that using a constant batch size does not always minimize the expectation of the full norm of the empirical loss in training a deep neural network.<n>We also provide numerical results supporting our analyses, indicating specifically that mini-batch SGDM with an increasing batch size converges to stationary points faster than with a constant batch size.
arXiv Detail & Related papers (2025-01-15T15:53:27Z) - Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations.
This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z) - Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive Model [0.0]
We introduce a new class of adaptive nonlinear autoregressive (Nlar) models incorporating the concept of momentum.<n>Within this framework, we propose three distinct estimators for learning rates and provide theoretical proof of their convergence.
arXiv Detail & Related papers (2024-10-13T17:55:58Z) - Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice [7.494722456816369]
We propose a probabilistic learning rate scheduler (PLRS)<n>PLRS does not conform to the monotonically decreasing condition, with provable convergence guarantees.<n>We show that PLRS performs as well as or better than existing state-of-the-art learning rate schedulers in terms of convergence as well as accuracy.
arXiv Detail & Related papers (2024-07-10T12:52:24Z) - Rich-Observation Reinforcement Learning with Continuous Latent Dynamics [43.84391209459658]
We introduce a new theoretical framework, RichCLD (Rich-Observation RL with Continuous Latent Dynamics), in which the agent performs control based on high-dimensional observations.
Our main contribution is a new algorithm for this setting that is provably statistically and computationally efficient.
arXiv Detail & Related papers (2024-05-29T17:02:49Z) - Hierarchical Decomposition of Prompt-Based Continual Learning:
Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice.
HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics.
Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z) - AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning
Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks.
Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored.
We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - Guaranteed Conservation of Momentum for Learning Particle-based Fluid
Dynamics [96.9177297872723]
We present a novel method for guaranteeing linear momentum in learned physics simulations.
We enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers.
In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially.
arXiv Detail & Related papers (2022-10-12T09:12:59Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights.
We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods.
We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.