Related papers: Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

URL: http://arxiv.org/abs/2206.06900v1
Date: Tue, 14 Jun 2022 14:55:27 GMT
Title: Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method
Authors: Aaron Defazio, Baoyu Zhou, Lin Xiao
Abstract summary: We introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
Score: 17.275654092947647
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.

Related papers

First-ish Order Methods: Hessian-aware Scalings of Gradient Descent [11.125968799758436]
A key limitation of gradient descent is its lack of natural scaling. By accounting for curvature, our adaptive, Hessian-aware scaling method ensures a local unit step size. We show that our method converges globally under a significantly weaker version of the standard Lipschitz assumption.
arXiv Detail & Related papers (2025-02-06T01:22:23Z)
An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes [17.804065824245402]
In machine learning applications, each loss function is non-negative and can be expressed as the composition of a square and its real-valued square root. We show how to apply the Gauss-Newton method or the Levssquardt method to minimize the average of smooth but possibly non-negative functions.
arXiv Detail & Related papers (2024-07-05T08:53:06Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization [14.009179786857802]
We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. In this paper, we interpret adaptive gradient methods as steepest descent applied on parameter-scaled networks.
arXiv Detail & Related papers (2024-01-06T15:45:29Z)
Aiming towards the minimizers: fast convergence of SGD for overparametrized problems [25.077446336619378]
We propose a regularity regime which endows the gradient method with the same worst-case complexity as the gradient method. All existing guarantees require the gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. We demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
arXiv Detail & Related papers (2023-06-05T05:21:01Z)
Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field. Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous. We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
Explore Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize Scaling [34.35013145885164]
Extragradient methods have become a staple for solving large-scale saddlepoint problems in machine learning. We show in this paper that running vanilla extragradient with gradients may jeopardize its convergence, even in simple bilinear models. We show that this modification allows the method to converge even with gradients, and we derive sharp convergence rates under an error bound condition.
arXiv Detail & Related papers (2020-03-23T10:24:27Z)
Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule. We introduce a "grafting" experiment which decouples an update's magnitude from its direction. We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.