Expectigrad: Fast Stochastic Optimization with Robust Convergence
Properties
- URL: http://arxiv.org/abs/2010.01356v2
- Date: Tue, 12 Oct 2021 17:17:06 GMT
- Title: Expectigrad: Fast Stochastic Optimization with Robust Convergence
Properties
- Authors: Brett Daley and Christopher Amato
- Abstract summary: We propose a novel method called Expectigrad, which adjusts according to a per-component unweighted mean of all historical momentum term jointly between the numerator and denominator.
We prove that Expectigrad cannot diverge on every instance of gradient optimization problem known to cause Adam to diverge.
- Score: 18.973116252065278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many popular adaptive gradient methods such as Adam and RMSProp rely on an
exponential moving average (EMA) to normalize their stepsizes. While the EMA
makes these methods highly responsive to new gradient information, recent
research has shown that it also causes divergence on at least one convex
optimization problem. We propose a novel method called Expectigrad, which
adjusts stepsizes according to a per-component unweighted mean of all
historical gradients and computes a bias-corrected momentum term jointly
between the numerator and denominator. We prove that Expectigrad cannot diverge
on every instance of the optimization problem known to cause Adam to diverge.
We also establish a regret bound in the general stochastic nonconvex setting
that suggests Expectigrad is less susceptible to gradient variance than
existing methods are. Testing Expectigrad on several high-dimensional machine
learning tasks, we find it often performs favorably to state-of-the-art methods
with little hyperparameter tuning.
Related papers
- Randomised Splitting Methods and Stochastic Gradient Descent [0.0]
We introduce a new minibatching strategy (called Symmetric Minibatching Strategy) for gradient optimisation.
We provide improved convergence guarantees for this new minibatching strategy using Lynov techniques.
We argue that this also leads to a faster convergence rate when considering a decreasing stepsize schedule.
arXiv Detail & Related papers (2025-04-05T20:07:34Z) - A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.
In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes [17.804065824245402]
In machine learning applications, each loss function is non-negative and can be expressed as the composition of a square and its real-valued square root.
We show how to apply the Gauss-Newton method or the Levssquardt method to minimize the average of smooth but possibly non-negative functions.
arXiv Detail & Related papers (2024-07-05T08:53:06Z) - High-Probability Bounds for Stochastic Optimization and Variational
Inequalities: the Case of Unbounded Variance [59.211456992422136]
We propose algorithms with high-probability convergence results under less restrictive assumptions.
These results justify the usage of the considered methods for solving problems that do not fit standard functional classes in optimization.
arXiv Detail & Related papers (2023-02-02T10:37:23Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Convergence Properties of Stochastic Hypergradients [38.64355126221992]
We study approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk on a large dataset.
We provide numerical experiments to support our theoretical analysis and to show the advantage of using hypergradients in practice.
arXiv Detail & Related papers (2020-11-13T20:50:36Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Adaptive First-and Zeroth-order Methods for Weakly Convex Stochastic
Optimization Problems [12.010310883787911]
We analyze a new family of adaptive subgradient methods for solving an important class of weakly convex (possibly nonsmooth) optimization problems.
Experimental results indicate how the proposed algorithms empirically outperform its zerothorder gradient descent and its design variant.
arXiv Detail & Related papers (2020-05-19T07:44:52Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.