On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
- URL: http://arxiv.org/abs/2205.10287v3
- Date: Fri, 01 Nov 2024 02:01:18 GMT
- Title: On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
- Authors: Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora,
- Abstract summary: Approxing Gradient Descent (SGD) as a Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory.
This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of correctness as well as experimental validation of their applicability.
- Score: 45.007261870784475
- License:
- Abstract: Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.
Related papers
- Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise [15.535139686653611]
This work introduces novel SDEs for commonly used adaptive adaptives: SignSGD, RMSprop(W), and Adam(W)
These SDEs offer a quantitatively accurate description of theses and help illuminate an intricate relationship between adaptivity, curvature noise, and gradient.
We believe our approach can provide valuable insights into best training practices and novel scaling rules.
arXiv Detail & Related papers (2024-11-24T19:07:31Z) - Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation [0.8192907805418583]
We show that biased gradients converge to critical points for smooth non- functions.
We show how the effect of bias can be reduced by appropriate tuning.
arXiv Detail & Related papers (2024-02-05T10:17:36Z) - Private Adaptive Gradient Methods for Convex Optimization [32.3523019355048]
We propose and analyze differentially private variants of a Gradient Descent (SGD) algorithm with adaptive stepsizes.
We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal.
arXiv Detail & Related papers (2021-06-25T16:46:45Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Bayesian Sparse learning with preconditioned stochastic gradient MCMC
and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions.
We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Stochastic Normalizing Flows [52.92110730286403]
We introduce normalizing flows for maximum likelihood estimation and variational inference (VI) using differential equations (SDEs)
Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated, enabling efficient training of neural SDEs.
These SDEs can be used for constructing efficient chains to sample from the underlying distribution of a given dataset.
arXiv Detail & Related papers (2020-02-21T20:47:55Z) - On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization [80.03647903934723]
We prove adaptive gradient methods in expectation of gradient convergence methods.
Our analyses shed light on better adaptive gradient methods in optimizing non understanding gradient bounds.
arXiv Detail & Related papers (2018-08-16T20:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.