Variance Reduced Local SGD with Lower Communication Complexity
- URL: http://arxiv.org/abs/1912.12844v1
- Date: Mon, 30 Dec 2019 08:15:21 GMT
- Title: Variance Reduced Local SGD with Lower Communication Complexity
- Authors: Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen,
Yifei Cheng
- Abstract summary: We propose Variance Reduced Local SGD to further reduce the communication complexity.
VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
- Score: 52.44473777232414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To accelerate the training of machine learning models, distributed stochastic
gradient descent (SGD) and its variants have been widely adopted, which apply
multiple workers in parallel to speed up training. Among them, Local SGD has
gained much attention due to its lower communication cost. Nevertheless, when
the data distribution on workers is non-identical, Local SGD requires
$O(T^{\frac{3}{4}} N^{\frac{3}{4}})$ communications to maintain its
\emph{linear iteration speedup} property, where $T$ is the total number of
iterations and $N$ is the number of workers. In this paper, we propose Variance
Reduced Local SGD (VRL-SGD) to further reduce the communication complexity.
Benefiting from eliminating the dependency on the gradient variance among
workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration
speedup} with a lower communication complexity $O(T^{\frac{1}{2}}
N^{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct
experiments on three machine learning tasks, and the experimental results
demonstrate that VRL-SGD performs impressively better than Local SGD when the
data among workers are quite diverse.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Sparse-SignSGD with Majority Vote for Communication-Efficient
Distributed Learning [20.22227794319504]
$sf S3$GD-MV is a communication-efficient distributed optimization algorithm.
We show that it converges at the same rate as signSGD while significantly reducing communication costs.
These findings highlight the potential of $sf S3$GD-MV as a promising solution for communication-efficient distributed optimization in deep learning.
arXiv Detail & Related papers (2023-02-15T05:36:41Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Offline Reinforcement Learning at Multiple Frequencies [62.08749079914275]
We study how well offline reinforcement learning algorithms can accommodate data with a mixture of frequencies during training.
We present a simple yet effective solution that enforces consistency in the rate of $Q$-value updates to stabilize learning.
arXiv Detail & Related papers (2022-07-26T17:54:49Z) - Trade-offs of Local SGD at Scale: An Empirical Study [24.961068070560344]
We study a technique known as local SGD to reduce communication overhead.
We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy.
We also show that incorporating the slow momentum framework consistently improves accuracy without requiring additional communication.
arXiv Detail & Related papers (2021-10-15T15:00:42Z) - Communication-efficient SGD: From Local SGD to One-Shot Averaging [16.00658606157781]
We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers.
We suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows.
arXiv Detail & Related papers (2021-06-09T01:10:34Z) - Why Does Multi-Epoch Training Help? [62.946840431501855]
Empirically, it has been observed that taking more one pass over training data (multi-pass SGD) has much better excess risk bound performance than SGD only taking one pass over training data (one-pass SGD)
In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstances.
arXiv Detail & Related papers (2021-05-13T00:52:25Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - STL-SGD: Speeding Up Local SGD with Stagewise Communication Period [19.691927007250417]
Local gradient descent (Local SGD) has attracted significant attention due to its low communication complexity.
STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD.
Experiments on both convex and nonfrac problems demonstrate the superior performance STL-SGD.
arXiv Detail & Related papers (2020-06-11T12:48:17Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.