Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters
- URL: http://arxiv.org/abs/2506.11904v1
- Date: Fri, 13 Jun 2025 15:53:17 GMT
- Title: Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters
- Authors: Mathukumalli Vidyasagar,
- Abstract summary: We present a unified algorithm for optimization that makes use of a "momentum" term.<n>The gradient depends not only on the current true gradient of the objective function, but also on the true gradient at the previous iteration.
- Score: 0.7614628596146599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a unified algorithm for stochastic optimization that makes use of a "momentum" term; in other words, the stochastic gradient depends not only on the current true gradient of the objective function, but also on the true gradient at the previous iteration. Our formulation includes the Stochastic Heavy Ball (SHB) and the Stochastic Nesterov Accelerated Gradient (SNAG) algorithms as special cases. In addition, in our formulation, the momentum term is allowed to vary as a function of time (i.e., the iteration counter). The assumptions on the stochastic gradient are the most general in the literature, in that it can be biased, and have a conditional variance that grows in an unbounded fashion as a function of time. This last feature is crucial in order to make the theory applicable to "zero-order" methods, where the gradient is estimated using just two function evaluations. We present a set of sufficient conditions for the convergence of the unified algorithm. These conditions are natural generalizations of the familiar Robbins-Monro and Kiefer-Wolfowitz-Blum conditions for standard stochastic gradient descent. We also analyze another method from the literature for the SHB algorithm with a time-varying momentum parameter, and show that it is impracticable.
Related papers
- Limit Theorems for Stochastic Gradient Descent with Infinite Variance [47.87144151929621]
We show that the gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-rnstein process driven by an appropriate L'evy process.<n>We also explore the applications of these results in linear regression and logistic regression models.
arXiv Detail & Related papers (2024-10-21T09:39:10Z) - An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness [15.656614304616006]
This paper investigates a class of bilevel optimization problems where the upper-level function is non convex and the lower-level problem is strongly convex.<n>These problems have significant applications in data learning, such as text classification using unbounded networks.<n>We propose a new Accelerated Bilevel Optimization algorithm named AccBO.
arXiv Detail & Related papers (2024-09-28T02:30:44Z) - One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm.
We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z) - Exponential Concentration in Stochastic Approximation [0.8192907805418583]
We analyze the behavior of approximation algorithms where iterates, in expectation, progress towards an objective at each step.
We apply our results to several different Markov Approximation algorithms, specifically Projected Gradient Descent, Kiefer-Wolfowitz and Frank-Wolfe algorithms.
arXiv Detail & Related papers (2022-08-15T14:57:26Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z) - A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning [13.908826484332282]
We study a new two-time-scale gradient method for solving optimization problems.
Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale gradient algorithm.
We apply our framework to gradient-based policy evaluation algorithms in reinforcement learning.
arXiv Detail & Related papers (2021-09-29T23:15:23Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Stochastic optimization with momentum: convergence, fluctuations, and
traps avoidance [0.0]
In this paper, a general optimization procedure is studied, unifying several variants of the gradient descent such as, among others, the heavy ball method, the Nesterov Accelerated Gradient (S-NAG), and the widely used Adam method.
The avoidance is studied as a noisy discretization of a non-autonomous ordinary differential equation.
arXiv Detail & Related papers (2020-12-07T19:14:49Z) - Asymptotic Analysis of Conditioned Stochastic Gradient Descent [0.0]
We investigate a class of gradient descent (SGD) algorithms, called Conditioned SGD, based on a preconditioning of the gradient direction.
Almost sure convergence results, which may be of independent interest, are presented.
arXiv Detail & Related papers (2020-06-04T10:08:05Z) - Non-asymptotic bounds for stochastic optimization with biased noisy
gradient oracles [8.655294504286635]
We introduce biased gradient oracles to capture a setting where the function measurements have an estimation error.
Our proposed oracles are in practical contexts, for instance, risk measure estimation from a batch of independent and identically distributed simulation.
arXiv Detail & Related papers (2020-02-26T12:53:04Z) - Bounding the expected run-time of nonconvex optimization with early
stopping [2.7648976108201815]
This work examines the convergence of gradient-based optimization algorithms that use early stopping based on a validation function.
We derive conditions that guarantee this stopping rule is well-defined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion.
arXiv Detail & Related papers (2020-02-20T16:43:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.