Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2110.11442v1
- Date: Thu, 21 Oct 2021 19:22:14 GMT
- Title: Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent
- Authors: Sharan Vaswani, Benjamin Dubois-Taine, Reza Babanezhad
- Abstract summary: We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
- Score: 7.176107039687231
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We design step-size schemes that make stochastic gradient descent (SGD)
adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii)
problem-dependent constants. When minimizing smooth, strongly-convex functions
with condition number $\kappa$, we first prove that $T$ iterations of SGD with
Nesterov acceleration and exponentially decreasing step-sizes can achieve a
near-optimal $\tilde{O}(\exp(-T/\sqrt{\kappa}) + \sigma^2/T)$ convergence rate.
Under a relaxed assumption on the noise, with the same step-size scheme and
knowledge of the smoothness, we prove that SGD can achieve an
$\tilde{O}(\exp(-T/\kappa) + \sigma^2/T)$ rate. In order to be adaptive to the
smoothness, we use a stochastic line-search (SLS) and show (via upper and
lower-bounds) that SGD converges at the desired rate, but only to a
neighbourhood of the solution. Next, we use SGD with an offline estimate of the
smoothness and prove convergence to the minimizer. However, its convergence is
slowed down proportional to the estimation error and we prove a lower-bound
justifying this slowdown. Compared to other step-size schemes, we empirically
demonstrate the effectiveness of exponential step-sizes coupled with a novel
variant of SLS.
Related papers
- Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Two Sides of One Coin: the Limits of Untuned SGD and the Power of
Adaptive Methods [22.052459124774504]
We show that adaptive methods over untuned SGD alleviating the issue with smoothness and information advantage.
Our results provide theoretical justification for adaptive methods over untuned SGD in the absence of such exponential dependency.
arXiv Detail & Related papers (2023-05-21T14:40:43Z) - STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions.
Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z) - A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG)
$lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG.
$mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z) - On the Convergence of Step Decay Step-Size for Stochastic Optimization [27.02857082612736]
The convergence of neural descent is highly dependent on the step-size rate, especially on non-math problems such as network problems.
We provide the convergence for decay in the non-smooth regime, ensuring the gradient norm vanishes.
In the strongly convex case establish an $( T/ST)$ rate, we also prove to be an $( T/ST)$ rate.
arXiv Detail & Related papers (2021-02-18T14:37:25Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.