Achieving acceleration despite very noisy gradients
- URL: http://arxiv.org/abs/2302.05515v2
- Date: Fri, 26 May 2023 01:57:53 GMT
- Title: Achieving acceleration despite very noisy gradients
- Authors: Kanan Gupta, Jonathan Siegel, Stephan Wojtowytsch
- Abstract summary: We present a generalization of Nesterov's accelerated gradient descent algorithm.
AGNES achieves accelerated convergence rate no matter how small the signal to noise ratio in the gradient estimate.
We show that AGNES outperforms gradient descent with momentum and Nesterov's method in the training of CNNs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a generalization of Nesterov's accelerated gradient descent
algorithm. Our algorithm (AGNES) provably achieves acceleration for smooth
convex minimization tasks with noisy gradient estimates if the noise intensity
is proportional to the magnitude of the gradient. Nesterov's accelerated
gradient descent does not converge under this noise model if the constant of
proportionality exceeds one. AGNES fixes this deficiency and provably achieves
an accelerated convergence rate no matter how small the signal to noise ratio
in the gradient estimate. Empirically, we demonstrate that this is an
appropriate model for mini-batch gradients in overparameterized deep learning.
Finally, we show that AGNES outperforms stochastic gradient descent with
momentum and Nesterov's method in the training of CNNs.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Acceleration and Implicit Regularization in Gaussian Phase Retrieval [5.484345596034159]
In this setting, we prove that methods with Polyak or Nesterov momentum implicit regularization ensures a nice convex descent.
Experimental evidence shows that these methods converge faster than gradient descent in practice.
arXiv Detail & Related papers (2023-11-21T04:10:03Z) - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence.
Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases.
In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point.
We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics [25.95229631113089]
We show that the gradient noise possesses finite variance, and therefore the Central Limit Theorem (CLT) applies.
We then demonstrate the existence of the steady-state distribution of gradient descent and approximate the distribution at a small learning rate.
arXiv Detail & Related papers (2021-09-20T20:39:14Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Scaling transition from momentum stochastic gradient descent to plain
stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters.
The plain gradient descent has not been corrected by the accumulated gradient.
The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - Hessian-Free High-Resolution Nesterov Acceleration for Sampling [55.498092486970364]
Nesterov's Accelerated Gradient (NAG) for optimization has better performance than its continuous time limit (noiseless kinetic Langevin) when a finite step-size is employed.
This work explores the sampling counterpart of this phenonemon and proposes a diffusion process, whose discretizations can yield accelerated gradient-based MCMC methods.
arXiv Detail & Related papers (2020-06-16T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.