Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models
- URL: http://arxiv.org/abs/2106.03696v1
- Date: Mon, 7 Jun 2021 15:08:24 GMT
- Title: Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models
- Authors: Courtney Paquette, Elliot Paquette
- Abstract summary: We analyze a class of gradient algorithms with momentum on a high-dimensional random least squares problem.
We show that (small-batch) momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly.
In the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.
- Score: 0.2741266294612776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze a class of stochastic gradient algorithms with momentum on a
high-dimensional random least squares problem. Our framework, inspired by
random matrix theory, provides an exact (deterministic) characterization for
the sequence of loss values produced by these algorithms which is expressed
only in terms of the eigenvalues of the Hessian. This leads to simple
expressions for nearly-optimal hyperparameters, a description of the limiting
neighborhood, and average-case complexity.
As a consequence, we show that (small-batch) stochastic heavy-ball momentum
with a fixed momentum parameter provides no actual performance improvement over
SGD when step sizes are adjusted correctly. For contrast, in the non-strongly
convex setting, it is possible to get a large improvement over SGD using
momentum. By introducing hyperparameters that depend on the number of samples,
we propose a new algorithm sDANA (stochastic dimension adjusted Nesterov
acceleration) which obtains an asymptotically optimal average-case complexity
while remaining linearly convergent in the strongly convex setting without
adjusting parameters.
Related papers
- Constrained Optimization via Exact Augmented Lagrangian and Randomized
Iterative Sketching [55.28394191394675]
We develop an adaptive inexact Newton method for equality-constrained nonlinear, nonIBS optimization problems.
We demonstrate the superior performance of our method on benchmark nonlinear problems, constrained logistic regression with data from LVM, and a PDE-constrained problem.
arXiv Detail & Related papers (2023-05-28T06:33:37Z) - Formal guarantees for heuristic optimization algorithms used in machine
learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize
Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large.
Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Bayesian Sparse learning with preconditioned stochastic gradient MCMC
and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions.
We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients.
We prove new complexity that outperform state-of-the-art results in this case.
We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z) - Stochastic Proximal Gradient Algorithm with Minibatches. Application to
Large Scale Learning Models [2.384873896423002]
We develop and analyze minibatch variants of gradient algorithm for general composite objective functions with nonsmooth components.
We provide complexity for constant and variable stepsize iteration policies obtaining that, for minibatch size $N$, after $mathcalO(frac1Nepsilon)$ $epsilon-$subity is attained in expected quadratic distance to optimal solution.
arXiv Detail & Related papers (2020-03-30T10:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.