Open Problem: Anytime Convergence Rate of Gradient Descent
- URL: http://arxiv.org/abs/2406.13888v1
- Date: Wed, 19 Jun 2024 23:34:47 GMT
- Title: Open Problem: Anytime Convergence Rate of Gradient Descent
- Authors: Guy Kornowski, Ohad Shamir,
- Abstract summary: We show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence.
We ask: Is there any stepsize schedule for gradient descent that accelerates the classic $mathcalO (1/T)$ convergence rate, at emphany stopping time $T$?
- Score: 37.300102993926046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$?
Related papers
- Anytime Acceleration of Gradient Descent [92.02082223856479]
We propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O(T-1.03)$ for any stopping time.
We extend our theory to yield convergence guarantees of $exp(-Omega(T/kappa0.97)$ for smooth and strongly convex optimization.
arXiv Detail & Related papers (2024-11-26T18:29:44Z) - Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Provably Faster Gradient Descent via Long Steps [0.0]
We show that long steps, which may increase the objective value in the short term, lead to provably faster convergence in the long term.
A conjecture towards proving a faster $O(1/Tlog T)$ rate for gradient descent is also motivated along with simple numerical validation.
arXiv Detail & Related papers (2023-07-12T17:41:07Z) - Gradient Norm Minimization of Nesterov Acceleration: $o(1/k^3)$ [6.53306151979817]
Nesterov's accelerated gradient descent (NAG) is one of the milestones of first-order algorithms.
In this paper, we provide a significantly simplified proof based on precise observation and a tighter inequality for $L$-smooth functions.
arXiv Detail & Related papers (2022-09-19T09:10:35Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$.
Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Learning Unitaries by Gradient Descent [12.354076490479516]
We learn unitary transformations in $Ud)$ via gradient descent on time parameters of alternating sequences.
With less gradient$ parameters, gradient descent converges to a sub-optimal solution, whereas with more than $d2$ parameters, gradient descent converges to an optimal solution.
arXiv Detail & Related papers (2020-01-31T15:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.