Related papers: Communication-efficient SGD: From Local SGD to One-Shot Averaging

Communication-efficient SGD: From Local SGD to One-Shot Averaging

URL: http://arxiv.org/abs/2106.04759v1
Date: Wed, 9 Jun 2021 01:10:34 GMT
Title: Communication-efficient SGD: From Local SGD to One-Shot Averaging
Authors: Artin Spiridonoff, Alex Olshevsky and Ioannis Ch. Paschalidis
Abstract summary: We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers. We suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows.
Score: 16.00658606157781
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $N$ workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $\Omega ( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(NT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $\Omega \left( N \left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this paper, we suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows. Our analysis shows that this can achieve an error that scales as $1/(NT)$ with a number of communications that is completely independent of $T$. In particular, we show that $\Omega(N)$ communications are sufficient. Empirical evidence suggests this bound is close to tight as we further show that $\sqrt{N}$ or $N^{3/4}$ communications fail to achieve linear speed-up in simulations. Moreover, we show that under mild assumptions, the main of which is twice differentiability on any neighborhood of the optimal solution, one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.

Related papers

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction [57.93371273485736]
We consider a centralized distributed learning setup where all workers jointly find an unbiased bound LDeltaepsilon2,$ better poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution.
arXiv Detail & Related papers (2025-06-30T13:27:39Z)
The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity [2.845817138242963]
We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD) We demonstrate that incorporating $K > 1$ local update steps can reduce communication complexity.
arXiv Detail & Related papers (2024-03-23T00:01:34Z)
A Quadratic Synchronization Rule for Distributed Deep Learning [66.68264684667562]
This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR) Experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies.
arXiv Detail & Related papers (2023-10-22T21:38:57Z)
Cooperative Multi-Agent Reinforcement Learning: Asynchronous Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes. We propose a provably efficient algorithm based on value that enable asynchronous communication. We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$. We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig. We then propose algorithms matching the lower bounds either types of networks (up to log-overs) We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z)
Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$. Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z)
Sparse Normal Means Estimation with Sublinear Communication [13.257075075199781]
We consider the problem of normal means estimation in a distributed setting with communication constraints. We show that once the signal-to-noise ratio (SNR) is slightly higher, the support of $mu$ can be correctly recovered with much less communication.
arXiv Detail & Related papers (2021-02-05T08:52:25Z)
Local SGD With a Communication Overhead Depending Only on the Number of Workers [17.886554223172517]
We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications.
arXiv Detail & Related papers (2020-06-03T23:33:02Z)
Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity. VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.