Related papers: Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

URL: http://arxiv.org/abs/2408.02678v1
Date: Wed, 31 Jul 2024 04:25:39 GMT
Title: Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight
Authors: Wen-Liang Hwang,
Abstract summary: We analyze the convergence rate using programming with Polyak's acceleration. We show that the convergence rate can be written as exponential in step-size and momentum weight.
Score: 1.4141453107129398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

Related papers

Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Incremental Quasi-Newton Methods with Faster Superlinear Convergence Rates [50.36933471975506]
We consider the finite-sum optimization problem, where each component function is strongly convex and has Lipschitz continuous gradient and Hessian. The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate. This paper proposes a more efficient quasi-Newton method by incorporating the symmetric rank-1 update into the incremental framework.
arXiv Detail & Related papers (2024-02-04T05:54:51Z)
Formal guarantees for heuristic optimization algorithms used in machine learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems. We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum [14.158845925610438]
We show how a gradient proximal algorithm with momentum (PPA) allows faster convergence to a neighborhood compared to the proximal algorithm (PPA) with better contraction factor.
arXiv Detail & Related papers (2021-11-11T12:17:22Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models [0.2741266294612776]
We analyze a class of gradient algorithms with momentum on a high-dimensional random least squares problem. We show that (small-batch) momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. In the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.
arXiv Detail & Related papers (2021-06-07T15:08:24Z)
A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints [10.578409461429626]
In this work, we adapt the integral quadratic constraints theory to first-order methods for smooth and strongly-varying games and iteration. We provide emphfor the first time a global convergence rate for the negative momentum method(NM) with an complexity $mathcalO(kappa1.5)$, which matches its known lower bound. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch.
arXiv Detail & Related papers (2020-09-23T20:02:00Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Hessian-Free High-Resolution Nesterov Acceleration for Sampling [55.498092486970364]
Nesterov's Accelerated Gradient (NAG) for optimization has better performance than its continuous time limit (noiseless kinetic Langevin) when a finite step-size is employed. This work explores the sampling counterpart of this phenonemon and proposes a diffusion process, whose discretizations can yield accelerated gradient-based MCMC methods.
arXiv Detail & Related papers (2020-06-16T15:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.