Related papers: Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

URL: http://arxiv.org/abs/2508.03105v1
Date: Tue, 05 Aug 2025 05:32:36 GMT
Title: Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
Authors: Yuichi Kondo, Hideaki Iiduka,
Abstract summary: We analyze the convergence behavior of gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules.<n>Specifically, we extend the theoretical framework to cover three practical scheduling strategies commonly used in deep learning.
Score: 0.6906005491572401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules by introducing a novel Lyapunov function. This Lyapunov function has a simpler structure compared with existing ones, facilitating the challenging convergence analysis of SGDM and a unified analysis across various dynamic schedules. Specifically, we extend the theoretical framework to cover three practical scheduling strategies commonly used in deep learning: (i) constant batch size with a decaying learning rate, (ii) increasing batch size with a decaying learning rate, and (iii) increasing batch size with an increasing learning rate. Our theoretical results reveal a clear hierarchy in convergence behavior: while (i) does not guarantee convergence of the expected gradient norm, both (ii) and (iii) do. Moreover, (iii) achieves a provably faster decay rate than (i) and (ii), demonstrating theoretical acceleration even in the presence of momentum. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms fixed-hyperparameter baselines in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior. These findings provide a unified theoretical foundation and practical guidance for designing efficient and stable training procedures in modern deep learning.

Related papers

Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity [0.6906005491572401]
Batch-size and learning-rate scheduling in computational gradient methods can degrade efficiency and compromise convergence.<n>We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity.<n>Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.
arXiv Detail & Related papers (2025-08-07T11:52:25Z)
WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z)
Similarity Matching Networks: Hebbian Learning and Convergence Over Multiple Time Scales [5.093257685701887]
We consider and analyze the emphsimilarity matching network for principal subspace projection.<n>By leveraging a multilevel optimization framework, we prove convergence of the dynamics in the offline setting.
arXiv Detail & Related papers (2025-06-06T14:46:22Z)
Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations. This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z)
Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive Model [0.0]
We introduce a new class of adaptive nonlinear autoregressive (Nlar) models incorporating the concept of momentum.<n>Within this framework, we propose three distinct estimators for learning rates and provide theoretical proof of their convergence.
arXiv Detail & Related papers (2024-10-13T17:55:58Z)
Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice [7.494722456816369]
We propose a probabilistic learning rate scheduler (PLRS)<n>PLRS does not conform to the monotonically decreasing condition, with provable convergence guarantees.<n>We show that PLRS performs as well as or better than existing state-of-the-art learning rate schedulers in terms of convergence as well as accuracy.
arXiv Detail & Related papers (2024-07-10T12:52:24Z)
Rich-Observation Reinforcement Learning with Continuous Latent Dynamics [43.84391209459658]
We introduce a new theoretical framework, RichCLD (Rich-Observation RL with Continuous Latent Dynamics), in which the agent performs control based on high-dimensional observations. Our main contribution is a new algorithm for this setting that is provably statistically and computationally efficient.
arXiv Detail & Related papers (2024-05-29T17:02:49Z)
Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice. HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics. Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z)
AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks. Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored. We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z)
Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning. We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z)
Guaranteed Conservation of Momentum for Learning Particle-based Fluid Dynamics [96.9177297872723]
We present a novel method for guaranteeing linear momentum in learned physics simulations. We enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers. In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially.
arXiv Detail & Related papers (2022-10-12T09:12:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.