Related papers: Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

URL: http://arxiv.org/abs/2206.08307v1
Date: Thu, 16 Jun 2022 17:10:57 GMT
Title: Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning
Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
Abstract summary: We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
Score: 77.22019100456595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay $\tau_{\max}$ and show that an $\epsilon$-stationary point is reached after $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{\max}\epsilon^{-1}\right)$ iterations, where $\sigma$ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{avg}\epsilon^{-1}\right)$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker.

Related papers

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction [57.93371273485736]
We consider a centralized distributed learning setup where all workers jointly find an unbiased bound LDeltaepsilon2,$ better poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution.
arXiv Detail & Related papers (2025-06-30T13:27:39Z)
Faster Stochastic Optimization with Arbitrary Delays via Asynchronous Mini-Batching [25.95786289683455]
We show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays. We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form $O(inf_q tau_q2/(q T)2+sigma/sqrtqT)$ for non-size and $O(tau_q2/(q T)2+sigma/sqrtqT)$ for convex problems
arXiv Detail & Related papers (2024-08-14T12:30:51Z)
Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z)
An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z)
ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z)
DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$. We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z)
Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z)
Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$. Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z)
Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$. No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.