Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2110.11442v1
- Date: Thu, 21 Oct 2021 19:22:14 GMT
- Title: Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent
- Authors: Sharan Vaswani, Benjamin Dubois-Taine, Reza Babanezhad
- Abstract summary: We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
- Score: 7.176107039687231
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We design step-size schemes that make stochastic gradient descent (SGD)
adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii)
problem-dependent constants. When minimizing smooth, strongly-convex functions
with condition number $\kappa$, we first prove that $T$ iterations of SGD with
Nesterov acceleration and exponentially decreasing step-sizes can achieve a
near-optimal $\tilde{O}(\exp(-T/\sqrt{\kappa}) + \sigma^2/T)$ convergence rate.
Under a relaxed assumption on the noise, with the same step-size scheme and
knowledge of the smoothness, we prove that SGD can achieve an
$\tilde{O}(\exp(-T/\kappa) + \sigma^2/T)$ rate. In order to be adaptive to the
smoothness, we use a stochastic line-search (SLS) and show (via upper and
lower-bounds) that SGD converges at the desired rate, but only to a
neighbourhood of the solution. Next, we use SGD with an offline estimate of the
smoothness and prove convergence to the minimizer. However, its convergence is
slowed down proportional to the estimation error and we prove a lower-bound
justifying this slowdown. Compared to other step-size schemes, we empirically
demonstrate the effectiveness of exponential step-sizes coupled with a novel
variant of SLS.
Related papers
- Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
Adaptive gradient methods are arguably the most successful optimization algorithms for neural network.
We show that adaptive gradient methods can potentially shave a factor Adad-ell/ell$ geometry.
arXiv Detail & Related papers (2024-06-07T02:55:57Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Two Sides of One Coin: the Limits of Untuned SGD and the Power of
Adaptive Methods [22.052459124774504]
We show that adaptive methods over untuned SGD alleviating the issue with smoothness and information advantage.
Our results provide theoretical justification for adaptive methods over untuned SGD in the absence of such exponential dependency.
arXiv Detail & Related papers (2023-05-21T14:40:43Z) - STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions.
Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z) - On the Convergence of Step Decay Step-Size for Stochastic Optimization [27.02857082612736]
The convergence of neural descent is highly dependent on the step-size rate, especially on non-math problems such as network problems.
We provide the convergence for decay in the non-smooth regime, ensuring the gradient norm vanishes.
In the strongly convex case establish an $( T/ST)$ rate, we also prove to be an $( T/ST)$ rate.
arXiv Detail & Related papers (2021-02-18T14:37:25Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.