Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size
- URL: http://arxiv.org/abs/2506.23544v2
- Date: Tue, 01 Jul 2025 05:38:12 GMT
- Title: Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size
- Authors: Kento Imaizumi, Hideaki Iiduka,
- Abstract summary: Momentum methods were originally introduced for their superiority to gradientbatch descent (SGD) in deterministic settings with convex functions.<n>We show that achieving convergence requires either a decaying learning rate or an increasing batch size.
- Score: 0.6906005491572401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Momentum methods were originally introduced for their superiority to stochastic gradient descent (SGD) in deterministic settings with convex objective functions. However, despite their widespread application to deep neural networks -- a representative case of stochastic nonconvex optimization -- the theoretical justification for their effectiveness in such settings remains limited. Quasi-hyperbolic momentum (QHM) is an algorithm that generalizes various momentum methods and has been studied to better understand the class of momentum-based algorithms as a whole. In this paper, we provide both asymptotic and non-asymptotic convergence results for mini-batch QHM with an increasing batch size. We show that achieving asymptotic convergence requires either a decaying learning rate or an increasing batch size. Since a decaying learning rate adversely affects non-asymptotic convergence, we demonstrate that using mini-batch QHM with an increasing batch size -- without decaying the learning rate -- can be a more effective strategy. Our experiments show that even a finite increase in batch size can provide benefits for training neural networks.
Related papers
- Structure and asymptotic preserving deep neural surrogates for uncertainty quantification in multiscale kinetic equations [5.181697052513637]
High dimensionality of kinetic equations with parameters poses computational challenges for uncertainty quantification (UQ)<n>Traditional Monte Carlo (MC) sampling methods suffer from slow convergence and high variance, which become increasingly severe as the dimensionality of the space grows.<n>We introduce surrogate models based on structure and preserving neural networks (SAPNNs)<n>SAPNNs are specifically designed to satisfy key physical properties, including positivity, conservation laws, entropy dissipation, parameter limits.
arXiv Detail & Related papers (2025-06-12T12:20:53Z) - Non-asymptotic convergence analysis of the stochastic gradient Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with applications to training of ReLU neural networks [7.170021086161168]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.<n>To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise [16.12834917344859]
It is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings.
We show that heavy-ball momentum can provide $tildemathcalO(sqrtkappa)$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate.
This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning.
arXiv Detail & Related papers (2023-12-22T09:58:39Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium [58.26573117273626]
We consider the non-AL equilibrium nonconptotic objective function in two-player zero-sum continuous games.
Our novel insights into the particle-based algorithms for continuous distribution strategies are presented.
arXiv Detail & Related papers (2023-03-02T05:08:15Z) - Losing momentum in continuous-time stochastic optimisation [42.617042045455506]
momentum-based optimisation algorithms have become particularly widespread.
In this work, we analyse a continuous-time model for gradient descent with momentum.
We also train a convolutional neural network in an image classification problem.
arXiv Detail & Related papers (2022-09-08T10:46:05Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Non-asymptotic estimates for TUSLA algorithm for non-convex learning
with applications to neural networks with ReLU activation function [3.5044892799305956]
We provide a non-asymptotic analysis for the tamed un-adjusted Langevin algorithm (TUSLA) introduced in Lovas et al.
In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wassersteinstein-1-2.
We show that the TUSLA algorithm converges rapidly to the optimal solution.
arXiv Detail & Related papers (2021-07-19T07:13:02Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Learning Rates as a Function of Batch Size: A Random Matrix Theory
Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.
We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks.
We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.