Related papers: Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

URL: http://arxiv.org/abs/2305.17665v2
Date: Thu, 1 Feb 2024 12:25:16 GMT
Title: Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality
Authors: Kejie Tang, Weidong Liu, Yichen Zhang and Xi Chen
Abstract summary: gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings. We show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value.
Score: 7.793751573538388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.

Related papers

Stability and Generalization for Distributed SGDA [70.97400503482353]
We propose the stability-based generalization analytical framework for Distributed-SGDA. We conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics. Our theoretical results reveal the trade-off between the generalization gap and optimization error.
arXiv Detail & Related papers (2024-11-14T11:16:32Z)
Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD [7.434126318858966]
Distributed learning is essential to train machine learning algorithms across heterogeneous agents. We conduct an analysis of Unified Distributed SGD (UD-SGD) We assess how different sampling strategies, such as i.i.d. sampling, shuffling, and Markovian sampling, affect the convergence speed of UD-SGD.
arXiv Detail & Related papers (2024-09-26T03:12:20Z)
Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models [10.781866671930857]
We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit. We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
arXiv Detail & Related papers (2023-08-17T13:33:02Z)
Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality [5.817158625734484]
Gradient Descent (SGD) is one of the simplest and most popular algorithms in modern statistical and machine learning. Various averaging schemes have been proposed to accelerate the convergence of SGD in different settings.
arXiv Detail & Related papers (2023-07-13T17:29:01Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Low-Precision Stochastic Gradient Langevin Dynamics [70.69923368584588]
We provide the first study of low-precision Gradient Langevin Dynamics, showing that its costs can be significantly reduced without sacrificing performance. We develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
arXiv Detail & Related papers (2022-06-20T17:25:41Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error. For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z)
Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization [10.660480034605241]
Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks. In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
arXiv Detail & Related papers (2020-10-01T13:46:32Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.