On the Last Iterate Convergence of Momentum Methods
- URL: http://arxiv.org/abs/2102.07002v1
- Date: Sat, 13 Feb 2021 21:16:16 GMT
- Title: On the Last Iterate Convergence of Momentum Methods
- Authors: Xiaoyu Li and Mingrui Liu and Francesco Orabona
- Abstract summary: We prove for the first time that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate suffers from an error.
We show that in the setting with convex and smooth functions, our new SGDM algorithm automatically converges at a rate of $O(fraclog TT)$.
- Score: 32.60494006038902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: SGD with Momentum (SGDM) is widely used for large scale optimization of
machine learning problems. Yet, the theoretical understanding of this algorithm
is not complete. In fact, even the most recent results require changes to the
algorithm like an averaging scheme and a projection onto a bounded domain,
which are never used in practice. Also, no lower bound is known for SGDM. In
this paper, we prove for the first time that for any constant momentum factor,
there exists a Lipschitz and convex function for which the last iterate of SGDM
suffers from an error $\Omega(\frac{\log T}{\sqrt{T}})$ after $T$ steps. Based
on this fact, we study a new class of (both adaptive and non-adaptive)
Follow-The-Regularized-Leader-based SGDM algorithms with \emph{increasing
momentum} and \emph{shrinking updates}. For these algorithms, we show that the
last iterate has optimal convergence $O (\frac{1}{\sqrt{T}})$ for unconstrained
convex optimization problems. Further, we show that in the interpolation
setting with convex and smooth functions, our new SGDM algorithm automatically
converges at a rate of $O(\frac{\log T}{T})$. Empirical results are shown as
well.
Related papers
- MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.
Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.
We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods [25.831462008050387]
The Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding.
It still remains unclear whether the lastiterate convergence can be provably extended to wider composite optimization and non-Euclidean norms.
arXiv Detail & Related papers (2023-12-13T21:41:06Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error.
Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis [27.679514676804057]
We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting.
Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
arXiv Detail & Related papers (2020-10-26T01:33:05Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.