Related papers: Gradient Norm Minimization of Nesterov Acceleration: $o(1/k^3)$

Gradient Norm Minimization of Nesterov Acceleration: $o(1/k^3)$

URL: http://arxiv.org/abs/2209.08862v1
Date: Mon, 19 Sep 2022 09:10:35 GMT
Title: Gradient Norm Minimization of Nesterov Acceleration: $o(1/k^3)$
Authors: Shuo Chen, Bin Shi, Ya-xiang Yuan
Abstract summary: Nesterov's accelerated gradient descent (NAG) is one of the milestones of first-order algorithms. In this paper, we provide a significantly simplified proof based on precise observation and a tighter inequality for $L$-smooth functions.
Score: 6.53306151979817
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In the history of first-order algorithms, Nesterov's accelerated gradient descent (NAG) is one of the milestones. However, the cause of the acceleration has been a mystery for a long time. It has not been revealed with the existence of gradient correction until the high-resolution differential equation framework proposed in [Shi et al., 2021]. In this paper, we continue to investigate the acceleration phenomenon. First, we provide a significantly simplified proof based on precise observation and a tighter inequality for $L$-smooth functions. Then, a new implicit-velocity high-resolution differential equation framework, as well as the corresponding implicit-velocity version of phase-space representation and Lyapunov function, is proposed to investigate the convergence behavior of the iterative sequence $\{x_k\}_{k=0}^{\infty}$ of NAG. Furthermore, from two kinds of phase-space representations, we find that the role played by gradient correction is equivalent to that by velocity included implicitly in the gradient, where the only difference comes from the iterative sequence $\{y_{k}\}_{k=0}^{\infty}$ replaced by $\{x_k\}_{k=0}^{\infty}$. Finally, for the open question of whether the gradient norm minimization of NAG has a faster rate $o(1/k^3)$, we figure out a positive answer with its proof. Meanwhile, a faster rate of objective value minimization $o(1/k^2)$ is shown for the case $r > 2$.

Related papers

A Family of Controllable Momentum Coefficients for Forward-Backward Accelerated Algorithms [4.404496835736175]
Nesterov's accelerated gradient method (NAG) marks a pivotal advancement in gradient-based optimization. Its algorithmic complexity when applied to strongly convex functions remains unknown. We introduce a family of controllable momentum coefficients for forward-backward accelerated methods.
arXiv Detail & Related papers (2025-01-17T09:15:18Z)
Anytime Acceleration of Gradient Descent [92.02082223856479]
We propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O(T-1.03)$ for any stopping time. We extend our theory to yield convergence guarantees of $exp(-Omega(T/kappa0.97)$ for smooth and strongly convex optimization.
arXiv Detail & Related papers (2024-11-26T18:29:44Z)
Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients [0.196629787330046]
We show that incorporating partial second-order information of the objective function can dramatically improve robustness to mini-batch size of variance-reduced gradient methods. We demonstrate this phenomenon on a prototypical Newton ($textttMb-SVRN$) algorithm.
arXiv Detail & Related papers (2024-04-23T05:45:52Z)
Accelerated Quasi-Newton Proximal Extragradient: Faster Rate for Smooth Convex Optimization [26.328847475942894]
We prove that our method can achieve a convergence rate of $Obigl(minfrac1k2, fracsqrtdlog kk2.5bigr)$, where $d$ is the problem dimension and $k$ is the number of iterations. To the best of our knowledge, this result is the first to demonstrate a provable gain of a quasi-Newton-type method over Nesterov's accelerated gradient.
arXiv Detail & Related papers (2023-06-03T23:31:27Z)
On Underdamped Nesterov's Acceleration [6.53306151979817]
High-resolution differential equation framework tailor-made for Nesterov's accelerated gradient descent method. New Lyapunov functions for the underdamped case are constructed. convergence rates are consistent with that obtained for the underdamped case with $r=-1$.
arXiv Detail & Related papers (2023-04-28T06:08:19Z)
Revisiting the acceleration phenomenon via high-resolution differential equations [6.53306151979817]
Nesterov's accelerated gradient descent (NAG) is one of the milestones in the history of first-order algorithms. We investigate NAG for the $mu$-strongly convex function based on the techniques of Lyapunov analysis and phase-space representation. We find that the high-resolution differential equation framework from the implicit-velocity scheme of NAG is perfect and outperforms the gradient-correction scheme.
arXiv Detail & Related papers (2022-12-12T04:36:37Z)
Generalization Bounds for Gradient Methods via Discrete and Continuous Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD) We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity [70.65867695317633]
We propose two simple accelerated gradient methods, restarted gradient descent (AGD) and restarted ball (HB) method. We establish that our methods achieve an $frac1epsilon)$ number of gradient iterations. Our algorithms are simple in the sense that they only consist of Nestov's classical AGD orak's HB as well as a restart mechanism.
arXiv Detail & Related papers (2022-01-27T10:04:04Z)
A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth. We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z)
A New Framework for Variance-Reduced Hamiltonian Monte Carlo [88.84622104944503]
We propose a new framework of variance-reduced Hamiltonian Monte Carlo (HMC) methods for sampling from an $L$-smooth and $m$-strongly log-concave distribution. We show that the unbiased gradient estimators, including SAGA and SVRG, based HMC methods achieve highest gradient efficiency with small batch size. Experimental results on both synthetic and real-world benchmark data show that our new framework significantly outperforms the full gradient and gradient HMC approaches.
arXiv Detail & Related papers (2021-02-09T02:44:24Z)
A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms. Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$. No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.