Acceleration of stochastic gradient descent with momentum by averaging:
finite-sample rates and asymptotic normality
- URL: http://arxiv.org/abs/2305.17665v2
- Date: Thu, 1 Feb 2024 12:25:16 GMT
- Title: Acceleration of stochastic gradient descent with momentum by averaging:
finite-sample rates and asymptotic normality
- Authors: Kejie Tang, Weidong Liu, Yichen Zhang and Xi Chen
- Abstract summary: gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications.
We analyze the finite-sample convergence rate of SGDM under the strongly convex settings.
We show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value.
- Score: 7.793751573538388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent with momentum (SGDM) has been widely used in many
machine learning and statistical applications. Despite the observed empirical
benefits of SGDM over traditional SGD, the theoretical understanding of the
role of momentum for different learning rates in the optimization process
remains widely open. We analyze the finite-sample convergence rate of SGDM
under the strongly convex settings and show that, with a large batch size, the
mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of
the optimal value. Additionally, our findings, supported by theoretical
analysis and numerical experiments, indicate that SGDM permits broader choices
of learning rates. Furthermore, we analyze the Polyak-averaging version of the
SGDM estimator, establish its asymptotic normality, and justify its asymptotic
equivalence to the averaged SGD. The asymptotic distribution of the averaged
SGDM enables uncertainty quantification of the algorithm output and statistical
inference of the model parameters.
Related papers
- Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD [7.434126318858966]
Distributed learning is essential to train machine learning algorithms across heterogeneous agents.
We conduct an analysis of Unified Distributed SGD (UD-SGD)
We assess how different sampling strategies, such as i.i.d. sampling, shuffling, and Markovian sampling, affect the convergence speed of UD-SGD.
arXiv Detail & Related papers (2024-09-26T03:12:20Z) - Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on
GLMs and multi-index models [10.781866671930857]
We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit.
We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations.
In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
arXiv Detail & Related papers (2023-08-17T13:33:02Z) - Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and
Optimality [5.817158625734484]
Gradient Descent (SGD) is one of the simplest and most popular algorithms in modern statistical and machine learning.
Various averaging schemes have been proposed to accelerate the convergence of SGD in different settings.
arXiv Detail & Related papers (2023-07-13T17:29:01Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Low-Precision Stochastic Gradient Langevin Dynamics [70.69923368584588]
We provide the first study of low-precision Gradient Langevin Dynamics, showing that its costs can be significantly reduced without sacrificing performance.
We develop a new quantization function for SGLD that preserves the variance in each update step.
We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
arXiv Detail & Related papers (2022-06-20T17:25:41Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Momentum via Primal Averaging: Theoretical Insights and Learning Rate
Schedules for Non-Convex Optimization [10.660480034605241]
Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks.
In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
arXiv Detail & Related papers (2020-10-01T13:46:32Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.