Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic
Gradient Descent using Stochastic Learning Rates
- URL: http://arxiv.org/abs/2110.12634v1
- Date: Mon, 25 Oct 2021 04:27:35 GMT
- Title: Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic
Gradient Descent using Stochastic Learning Rates
- Authors: Theodoros Mamalis, Dusan Stipanovic, Petros Voulgaris
- Abstract summary: This paper uses almost-sure convergence rates of gradient-sure convergence rates of Gradient Descent to solve large-scale optimization problems.
In particular, its learning rate is equipped with a multiplicativeity learning rate.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale optimization problems require algorithms both effective and
efficient. One such popular and proven algorithm is Stochastic Gradient Descent
which uses first-order gradient information to solve these problems. This paper
studies almost-sure convergence rates of the Stochastic Gradient Descent method
when instead of deterministic, its learning rate becomes stochastic. In
particular, its learning rate is equipped with a multiplicative stochasticity,
producing a stochastic learning rate scheme. Theoretical results show
accelerated almost-sure convergence rates of Stochastic Gradient Descent in a
nonconvex setting when using an appropriate stochastic learning rate, compared
to a deterministic-learning-rate scheme. The theoretical results are verified
empirically.
Related papers
- Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization [0.6906005491572401]
We show that noise in batch descent gradient (SGD) has the effect of smoothing objective function.
We analyze a new graduated optimization algorithm that varies the degree of smoothing by learning rate and batch size.
arXiv Detail & Related papers (2023-11-15T07:27:40Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Backward error analysis and the qualitative behaviour of stochastic
optimization algorithms: Application to stochastic coordinate descent [1.534667887016089]
We propose a class of differential equations that approximate the dynamics of general optimization methods more closely than the original gradient flow.
We study the stability of the modified equation in the case of coordinate descent.
arXiv Detail & Related papers (2023-09-05T09:39:56Z) - Byzantine-Robust Decentralized Stochastic Optimization with Stochastic
Gradient Noise-Independent Learning Error [25.15075119957447]
We study Byzantine-robust optimization over a decentralized network, where every agent periodically communicates with its neighbors to exchange local models, and then updates its own local model by gradient descent (SGD)
The performance of such a method is affected by an unknown number of Byzantine agents, which conduct adversarially during the optimization process.
arXiv Detail & Related papers (2023-08-10T02:14:23Z) - One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm.
We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z) - Non-Parametric Learning of Stochastic Differential Equations with Non-asymptotic Fast Rates of Convergence [65.63201894457404]
We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of non-linear differential equations.
The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations.
arXiv Detail & Related papers (2023-05-24T20:43:47Z) - Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z) - Stochastic Learning Rate Optimization in the Stochastic Approximation
and Online Learning Settings [0.0]
In this work, multiplicativeity is applied to the learning rate of optimization algorithms, giving rise to learning-rate schemes.
In this work, theoretical convergence results of Gradient Descent equipped with this novel learning rate scheme are presented.
arXiv Detail & Related papers (2021-10-20T18:10:03Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - Stochastic gradient descent with random learning rate [0.0]
We propose to optimize neural networks with a uniformly-distributed random learning rate.
By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime.
We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.
arXiv Detail & Related papers (2020-03-15T21:36:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.