Related papers: On the Last Iterate Convergence of Momentum Methods

On the Last Iterate Convergence of Momentum Methods

URL: http://arxiv.org/abs/2102.07002v1
Date: Sat, 13 Feb 2021 21:16:16 GMT
Title: On the Last Iterate Convergence of Momentum Methods
Authors: Xiaoyu Li and Mingrui Liu and Francesco Orabona
Abstract summary: We prove for the first time that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate suffers from an error. We show that in the setting with convex and smooth functions, our new SGDM algorithm automatically converges at a rate of $O(fraclog TT)$.
Score: 32.60494006038902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SGD with Momentum (SGDM) is widely used for large scale optimization of machine learning problems. Yet, the theoretical understanding of this algorithm is not complete. In fact, even the most recent results require changes to the algorithm like an averaging scheme and a projection onto a bounded domain, which are never used in practice. Also, no lower bound is known for SGDM. In this paper, we prove for the first time that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from an error $\Omega(\frac{\log T}{\sqrt{T}})$ after $T$ steps. Based on this fact, we study a new class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O (\frac{1}{\sqrt{T}})$ for unconstrained convex optimization problems. Further, we show that in the interpolation setting with convex and smooth functions, our new SGDM algorithm automatically converges at a rate of $O(\frac{\log T}{T})$. Empirical results are shown as well.

Related papers

Inverting the Leverage Score Gradient: An Efficient Approximate Newton Method [10.742859956268655]
This paper aims to recover the intrinsic model parameters given the leverage scores gradient. We specifically scrutinize the inversion of the leverage score gradient, denoted as $g(x)$.
arXiv Detail & Related papers (2024-08-21T01:39:42Z)
MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions. We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z)
Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods [25.831462008050387]
The Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. It still remains unclear whether the lastiterate convergence can be provably extended to wider composite optimization and non-Euclidean norms.
arXiv Detail & Related papers (2023-12-13T21:41:06Z)
Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z)
An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z)
Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error. Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z)
Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms. We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$. In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z)
$\ar{G}_{mst}$:An Unbiased Stratified Statistic and a Fast Gradient Optimization Algorithm Based on It [0.0]
A novel algorithm named MSSG designed based on $barG_mst$ outperforms other sgd-like algorithms.
arXiv Detail & Related papers (2021-10-07T11:48:55Z)
Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product. We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z)
A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems. MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z)
Last iterate convergence of SGD for Least-Squares in the Interpolation regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z)
Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis [27.679514676804057]
We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
arXiv Detail & Related papers (2020-10-26T01:33:05Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms. We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.