Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs
- URL: http://arxiv.org/abs/2006.07037v1
- Date: Fri, 12 Jun 2020 09:39:47 GMT
- Title: Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs
- Authors: Xunpeng Huang, Hao Zhou, Runxin Xu, Zhe Wang and Lei Li
- Abstract summary: We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
- Score: 25.158203665218164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient methods have attracted much attention of machine learning
communities due to the high efficiency. However their acceleration effect in
practice, especially in neural network training, is hard to analyze,
theoretically. The huge gap between theoretical convergence results and
practical performances prevents further understanding of existing optimizers
and the development of more advanced optimization methods. In this paper, we
provide adaptive gradient methods a novel analysis with an additional mild
assumption, and revise AdaGrad to \radagrad for matching a better provable
convergence rate. To find an $\epsilon$-approximate first-order stationary
point in non-convex objectives, we prove random shuffling \radagrad achieves a
$\tilde{O}(T^{-1/2})$ convergence rate, which is significantly improved by
factors $\tilde{O}(T^{-1/4})$ and $\tilde{O}(T^{-1/6})$ compared with existing
adaptive gradient methods and random shuffling SGD, respectively. To the best
of our knowledge, it is the first time to demonstrate that adaptive gradient
methods can deterministically be faster than SGD after finite epochs.
Furthermore, we conduct comprehensive experiments to validate the additional
mild assumption and the acceleration effect benefited from second moments and
random shuffling.
Related papers
- A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality [5.35599092568615]
We show that AdaGrad and Adam converge linearly when the cost function is smooth and satisfies the Polyak-Lojasiewicz inequality.
Our framework can potentially be utilized in analyzing linear convergence of other variants of Adam.
arXiv Detail & Related papers (2024-07-17T14:56:21Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - High Probability Analysis for Non-Convex Stochastic Optimization with
Clipping [13.025261730510847]
gradient clipping is a technique for dealing with the heavy-tailed neural networks.
Most theoretical guarantees only provide an in-expectation analysis and only on the performance.
Our analysis provides a relatively complete picture for the theoretical guarantee of optimization algorithms with gradient clipping.
arXiv Detail & Related papers (2023-07-25T17:36:56Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - A Unified Analysis of First-Order Methods for Smooth Games via Integral
Quadratic Constraints [10.578409461429626]
In this work, we adapt the integral quadratic constraints theory to first-order methods for smooth and strongly-varying games and iteration.
We provide emphfor the first time a global convergence rate for the negative momentum method(NM) with an complexity $mathcalO(kappa1.5)$, which matches its known lower bound.
We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch.
arXiv Detail & Related papers (2020-09-23T20:02:00Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z) - On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization [80.03647903934723]
We prove adaptive gradient methods in expectation of gradient convergence methods.
Our analyses shed light on better adaptive gradient methods in optimizing non understanding gradient bounds.
arXiv Detail & Related papers (2018-08-16T20:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.