Communication-efficient SGD: From Local SGD to One-Shot Averaging
- URL: http://arxiv.org/abs/2106.04759v1
- Date: Wed, 9 Jun 2021 01:10:34 GMT
- Title: Communication-efficient SGD: From Local SGD to One-Shot Averaging
- Authors: Artin Spiridonoff, Alex Olshevsky and Ioannis Ch. Paschalidis
- Abstract summary: We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers.
We suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows.
- Score: 16.00658606157781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider speeding up stochastic gradient descent (SGD) by parallelizing it
across multiple workers. We assume the same data set is shared among $N$
workers, who can take SGD steps and coordinate with a central server. While it
is possible to obtain a linear reduction in the variance by averaging all the
stochastic gradients at every step, this requires a lot of communication
between the workers and the server, which can dramatically reduce the gains
from parallelism. The Local SGD method, proposed and analyzed in the earlier
literature, suggests machines should make many local steps between such
communications. While the initial analysis of Local SGD showed it needs $\Omega
( \sqrt{T} )$ communications for $T$ local gradient steps in order for the
error to scale proportionately to $1/(NT)$, this has been successively improved
in a string of papers, with the state-of-the-art requiring $\Omega \left( N
\left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this
paper, we suggest a Local SGD scheme that communicates less overall by
communicating less frequently as the number of iterations grows. Our analysis
shows that this can achieve an error that scales as $1/(NT)$ with a number of
communications that is completely independent of $T$. In particular, we show
that $\Omega(N)$ communications are sufficient. Empirical evidence suggests
this bound is close to tight as we further show that $\sqrt{N}$ or $N^{3/4}$
communications fail to achieve linear speed-up in simulations. Moreover, we
show that under mild assumptions, the main of which is twice differentiability
on any neighborhood of the optimal solution, one-shot averaging which only uses
a single round of communication can also achieve the optimal convergence rate
asymptotically.
Related papers
- The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity [2.845817138242963]
We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD)
We demonstrate that incorporating $K > 1$ local update steps can reduce communication complexity.
arXiv Detail & Related papers (2024-03-23T00:01:34Z) - A Quadratic Synchronization Rule for Distributed Deep Learning [66.68264684667562]
This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR)
Experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies.
arXiv Detail & Related papers (2023-10-22T21:38:57Z) - Cooperative Multi-Agent Reinforcement Learning: Asynchronous
Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes.
We propose a provably efficient algorithm based on value that enable asynchronous communication.
We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$.
We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$.
Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z) - Sparse Normal Means Estimation with Sublinear Communication [13.257075075199781]
We consider the problem of normal means estimation in a distributed setting with communication constraints.
We show that once the signal-to-noise ratio (SNR) is slightly higher, the support of $mu$ can be correctly recovered with much less communication.
arXiv Detail & Related papers (2021-02-05T08:52:25Z) - Local SGD With a Communication Overhead Depending Only on the Number of
Workers [17.886554223172517]
We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers.
We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server.
The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications.
arXiv Detail & Related papers (2020-06-03T23:33:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.