Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
- URL: http://arxiv.org/abs/2312.14567v2
- Date: Sun, 17 Mar 2024 05:54:11 GMT
- Title: Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
- Authors: Rui Pan, Yuxing Liu, Xiaoyu Wang, Tong Zhang,
- Abstract summary: It is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings.
We show that heavy-ball momentum can provide $tildemathcalO(sqrtkappa)$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate.
This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning.
- Score: 16.12834917344859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Non-Parametric Learning of Stochastic Differential Equations with Non-asymptotic Fast Rates of Convergence [65.63201894457404]
We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of non-linear differential equations.
The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations.
arXiv Detail & Related papers (2023-05-24T20:43:47Z) - On the fast convergence of minibatch heavy ball momentum [6.154018226934517]
We show that heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on optimization problems.
The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum.
arXiv Detail & Related papers (2022-06-15T14:12:45Z) - Last-iterate convergence analysis of stochastic momentum methods for
neural networks [3.57214198937538]
The momentum method is used to solve large-scale optimization problems in neural networks.
Current convergence results of momentum methods under artificial settings.
The momentum factors can be fixed to be constant, rather than in existing time.
arXiv Detail & Related papers (2022-05-30T02:17:44Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - Reconciling Modern Deep Learning with Traditional Optimization Analyses:
The Intrinsic Learning Rate [36.83448475700536]
Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint.
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints.
We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
arXiv Detail & Related papers (2020-10-06T17:58:29Z) - Hessian-Free High-Resolution Nesterov Acceleration for Sampling [55.498092486970364]
Nesterov's Accelerated Gradient (NAG) for optimization has better performance than its continuous time limit (noiseless kinetic Langevin) when a finite step-size is employed.
This work explores the sampling counterpart of this phenonemon and proposes a diffusion process, whose discretizations can yield accelerated gradient-based MCMC methods.
arXiv Detail & Related papers (2020-06-16T15:07:37Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.