On the Convergence of Step Decay Step-Size for Stochastic Optimization
- URL: http://arxiv.org/abs/2102.09393v1
- Date: Thu, 18 Feb 2021 14:37:25 GMT
- Title: On the Convergence of Step Decay Step-Size for Stochastic Optimization
- Authors: Xiaoyu Wang, Sindri Magn\'usson and Mikael Johansson
- Abstract summary: The convergence of neural descent is highly dependent on the step-size rate, especially on non-math problems such as network problems.
We provide the convergence for decay in the non-smooth regime, ensuring the gradient norm vanishes.
In the strongly convex case establish an $( T/ST)$ rate, we also prove to be an $( T/ST)$ rate.
- Score: 27.02857082612736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The convergence of stochastic gradient descent is highly dependent on the
step-size, especially on non-convex problems such as neural network training.
Step decay step-size schedules (constant and then cut) are widely used in
practice because of their excellent convergence and generalization qualities,
but their theoretical properties are not yet well understood. We provide the
convergence results for step decay in the non-convex regime, ensuring that the
gradient norm vanishes at an $\mathcal{O}(\ln T/\sqrt{T})$ rate. We also
provide the convergence guarantees for general (possibly non-smooth) convex
problems, ensuring an $\mathcal{O}(\ln T/\sqrt{T})$ convergence rate. Finally,
in the strongly convex case, we establish an $\mathcal{O}(\ln T/T)$ rate for
smooth problems, which we also prove to be tight, and an $\mathcal{O}(\ln^2 T
/T)$ rate without the smoothness assumption. We illustrate the practical
efficiency of the step decay step-size in several large scale deep neural
network training tasks.
Related papers
- Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - S-ADDOPT: Decentralized stochastic first-order optimization over
directed graphs [16.96562173221624]
Decentralized convex optimization is proposed to minimize a sum of smooth and strongly cost functions when the functions are distributed over a directed network nodes.
In particular, we propose thetextbftextttS-ADDOPT algorithm that assumes a first-order oracle at each node.
For decaying step-sizes$mathcalO (1/k)$, we show thattextbfttS-ADDOPT reaches the exact solution subly at$mathcalO (1/k)$ and its convergence is networkally-independent
arXiv Detail & Related papers (2020-05-15T21:14:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.