Related papers: Convergence of the Stochastic Heavy Ball Method With Approximate Gradients and/or Block Updating

Convergence of the Stochastic Heavy Ball Method With Approximate Gradients and/or Block Updating

URL: http://arxiv.org/abs/2303.16241v4
Date: Fri, 25 Apr 2025 05:03:12 GMT
Title: Convergence of the Stochastic Heavy Ball Method With Approximate Gradients and/or Block Updating
Authors: Uday Kiran Reddy Tadipatri, Mathukumalli Vidyasagar,
Abstract summary: We establish the convergence of the Heavy Ball (SHB) algorithm under more general conditions than in the current literature.<n>Our analysis embraces not only convex functions, but also more general functions that satisfy the PL (Polyak-Lojasiewicz) and KL (Kurdyka-Lojasiewicz) conditions.
Score: 0.6445605125467572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we establish the convergence of the stochastic Heavy Ball (SHB) algorithm under more general conditions than in the current literature. Specifically, (i) The stochastic gradient is permitted to be biased, and also, to have conditional variance that grows over time (or iteration number). This feature is essential when applying SHB with zeroth-order methods, which use only two function evaluations to approximate the gradient. In contrast, all existing papers assume that the stochastic gradient is unbiased and/or has bounded conditional variance. (ii) The step sizes are permitted to be random, which is essential when applying SHB with block updating. The sufficient conditions for convergence are stochastic analogs of the well-known Robbins-Monro conditions. This is in contrast to existing papers where more restrictive conditions are imposed on the step size sequence. (iii) Our analysis embraces not only convex functions, but also more general functions that satisfy the PL (Polyak-{\L}ojasiewicz) and KL (Kurdyka-{\L}ojasiewicz) conditions. (iv) If the stochastic gradient is unbiased and has bounded variance, and the objective function satisfies (PL), then the iterations of SHB match the known best rates for convex functions. (v) We establish the almost-sure convergence of the iterations, as opposed to convergence in the mean or convergence in probability, which is the case in much of the literature. (vi) Each of the above convergence results continue to hold if full-coordinate updating is replaced by any one of three widely-used updating methods. In addition, numerical computations are carried out to illustrate the above points.

Related papers

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness [50.78508362183774]
Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance.<n>Most require the Lipschitz condition, which is often not met in common machine learning schemes.
arXiv Detail & Related papers (2025-07-11T15:36:48Z)
Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters [0.7614628596146599]
We present a unified algorithm for optimization that makes use of a "momentum" term.<n>The gradient depends not only on the current true gradient of the objective function, but also on the true gradient at the previous iteration.
arXiv Detail & Related papers (2025-06-13T15:53:17Z)
Ordering-based Conditions for Global Convergence of Policy Gradient Methods [73.6366483406033]
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation.<n>Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation.
arXiv Detail & Related papers (2025-04-02T21:06:28Z)
An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes [17.804065824245402]
In machine learning applications, each loss function is non-negative and can be expressed as the composition of a square and its real-valued square root. We show how to apply the Gauss-Newton method or the Levssquardt method to minimize the average of smooth but possibly non-negative functions.
arXiv Detail & Related papers (2024-07-05T08:53:06Z)
On the Last-Iterate Convergence of Shuffling Gradient Methods [21.865728815935665]
We prove the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value. Our results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.
arXiv Detail & Related papers (2024-03-12T15:01:17Z)
Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications [2.0584253077707477]
We study the convergence properties of the Gradient Descent (SGD) method for finding a stationary point of an objective function $J(cdot)$. Our results apply to a class of invex'' functions, which have the property that every stationary point is also a global minimizer.
arXiv Detail & Related papers (2023-12-05T15:22:39Z)
Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys. Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z)
Optimization using Parallel Gradient Evaluations on Multiple Parameters [51.64614793990665]
We propose a first-order method for convex optimization, where gradients from multiple parameters can be used during each step of gradient descent. Our method uses gradients from multiple parameters in synergy to update these parameters together towards the optima.
arXiv Detail & Related papers (2023-02-06T23:39:13Z)
High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance [59.211456992422136]
We propose algorithms with high-probability convergence results under less restrictive assumptions. These results justify the usage of the considered methods for solving problems that do not fit standard functional classes in optimization.
arXiv Detail & Related papers (2023-02-02T10:37:23Z)
SoftTreeMax: Policy Gradient with Tree Search [72.9513807133171]
We introduce SoftTreeMax, the first approach that integrates tree-search into policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in faster run-time compared with distributed PPO.
arXiv Detail & Related papers (2022-09-28T09:55:47Z)
Convergence of Batch Stochastic Gradient Descent Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results [0.9900482274337404]
We study convex optimization using a very general formulation called BSGD (Block Gradient Descent) We establish conditions for BSGD to converge to the global minimum, based on approximation theory. We show that when approximate gradients are used, BSGD converges while momentum-based methods can diverge.
arXiv Detail & Related papers (2022-09-12T16:23:15Z)
Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z)
Posterior and Computational Uncertainty in Gaussian Processes [52.26904059556759]
Gaussian processes scale prohibitively with the size of the dataset. Many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. We develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended.
arXiv Detail & Related papers (2022-05-30T22:16:25Z)
Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method. We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z)
On Almost Sure Convergence Rates of Stochastic Gradient Methods [11.367487348673793]
We show for the first time that the almost sure convergence rates obtained for gradient methods are arbitrarily close to their optimal convergence rates possible. For non- objective functions, we only show that a weighted average of squared gradient norms converges not almost surely, but also lasts to zero almost surely.
arXiv Detail & Related papers (2022-02-09T06:05:30Z)
Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size. Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product. We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z)
Cyclic Coordinate Dual Averaging with Extrapolation [22.234715500748074]
We introduce a new block coordinate method that applies to the general class of variational inequality (VI) problems with monotone operators. The resulting convergence bounds match the optimal convergence bounds of full gradient methods. For $m$ coordinate blocks, the resulting gradient Lipschitz constant in our bounds is never larger than a factor $sqrtm$ compared to the traditional Euclidean Lipschitz constant.
arXiv Detail & Related papers (2021-02-26T00:28:58Z)
Convergence of Gradient Algorithms for Nonconvex $C^{1+\alpha}$ Cost Functions [2.538209532048867]
We show a byproduct that a gradient converges and provide an explicit upper bound on the convergence rate. This allows us to prove the almost sure convergence by Doobmartin's supergale convergence theorem.
arXiv Detail & Related papers (2020-12-01T16:48:59Z)
Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent. Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z)
Multi-kernel Passive Stochastic Gradient Algorithms and Transfer Learning [21.796874356469644]
The gradient algorithm does not have control over the location where noisy gradients of the cost function are evaluated. The algorithm performs substantially better in high dimensional problems and incorporates variance reduction.
arXiv Detail & Related papers (2020-08-23T11:55:19Z)
On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration [115.1954841020189]
We study the inequality and non-asymptotic properties of approximation procedures with Polyak-Ruppert averaging. We prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity.
arXiv Detail & Related papers (2020-04-09T17:54:18Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.