Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning
- URL: http://arxiv.org/abs/2206.08307v1
- Date: Thu, 16 Jun 2022 17:10:57 GMT
- Title: Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning
- Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
- Abstract summary: We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
- Score: 77.22019100456595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the asynchronous stochastic gradient descent algorithm for
distributed training over $n$ workers which have varying computation and
communication frequency over time. In this algorithm, workers compute
stochastic gradients in parallel at their own pace and return those to the
server without any synchronization. Existing convergence rates of this
algorithm for non-convex smooth objectives depend on the maximum gradient delay
$\tau_{\max}$ and show that an $\epsilon$-stationary point is reached after
$\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{\max}\epsilon^{-1}\right)$
iterations, where $\sigma$ denotes the variance of stochastic gradients.
In this work (i) we obtain a tighter convergence rate of
$\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+
\sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the
algorithm where $\tau_{avg}$ is the average delay, which can be significantly
smaller than $\tau_{\max}$. We also provide (ii) a simple delay-adaptive
learning rate scheme, under which asynchronous SGD achieves a convergence rate
of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{avg}\epsilon^{-1}\right)$,
and does not require any extra hyperparameter tuning nor extra communications.
Our result allows to show for the first time that asynchronous SGD is always
faster than mini-batch SGD. In addition, (iii) we consider the case of
heterogeneous functions motivated by federated learning applications and
improve the convergence rate by proving a weaker dependence on the maximum
delay compared to prior works. In particular, we show that the heterogeneity
term in convergence rate is only affected by the average delay within each
worker.
Related papers
- Faster Stochastic Optimization with Arbitrary Delays via Asynchronous Mini-Batching [25.95786289683455]
We show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays.
We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form $O(inf_q tau_q2/(q T)2+sigma/sqrtqT)$ for non-size and $O(tau_q2/(q T)2+sigma/sqrtqT)$ for convex problems
arXiv Detail & Related papers (2024-08-14T12:30:51Z) - Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$.
We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$.
Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z) - Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$.
No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.