Related papers: Variance Reduced Local SGD with Lower Communication Complexity

Variance Reduced Local SGD with Lower Communication Complexity

URL: http://arxiv.org/abs/1912.12844v1
Date: Mon, 30 Dec 2019 08:15:21 GMT
Title: Variance Reduced Local SGD with Lower Communication Complexity
Authors: Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng
Abstract summary: We propose Variance Reduced Local SGD to further reduce the communication complexity. VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
Score: 52.44473777232414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires $O(T^{\frac{3}{4}} N^{\frac{3}{4}})$ communications to maintain its \emph{linear iteration speedup} property, where $T$ is the total number of iterations and $N$ is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity $O(T^{\frac{1}{2}} N^{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed Learning [20.22227794319504]
$sf S3$GD-MV is a communication-efficient distributed optimization algorithm. We show that it converges at the same rate as signSGD while significantly reducing communication costs. These findings highlight the potential of $sf S3$GD-MV as a promising solution for communication-efficient distributed optimization in deep learning.
arXiv Detail & Related papers (2023-02-15T05:36:41Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Offline Reinforcement Learning at Multiple Frequencies [62.08749079914275]
We study how well offline reinforcement learning algorithms can accommodate data with a mixture of frequencies during training. We present a simple yet effective solution that enforces consistency in the rate of $Q$-value updates to stabilize learning.
arXiv Detail & Related papers (2022-07-26T17:54:49Z)
Rethinking and Scaling Up Graph Contrastive Learning: An Extremely Efficient Approach with Group Discrimination [87.07410882094966]
Graph contrastive learning (GCL) alleviates the heavy reliance on label information for graph representation learning (GRL) We introduce a new learning paradigm for self-supervised GRL, namely, Group Discrimination (GD) Instead of similarity computation, GGD directly discriminates two groups of summarised node instances with a simple binary cross-entropy loss. In addition, GGD requires much fewer training epochs to obtain competitive performance compared with GCL methods on large-scale datasets.
arXiv Detail & Related papers (2022-06-03T12:32:47Z)
Trade-offs of Local SGD at Scale: An Empirical Study [24.961068070560344]
We study a technique known as local SGD to reduce communication overhead. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. We also show that incorporating the slow momentum framework consistently improves accuracy without requiring additional communication.
arXiv Detail & Related papers (2021-10-15T15:00:42Z)
Communication-efficient SGD: From Local SGD to One-Shot Averaging [16.00658606157781]
We consider speeding up gradient descent (SGD) by parallelizing it across multiple workers. We suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows.
arXiv Detail & Related papers (2021-06-09T01:10:34Z)
Why Does Multi-Epoch Training Help? [62.946840431501855]
Empirically, it has been observed that taking more one pass over training data (multi-pass SGD) has much better excess risk bound performance than SGD only taking one pass over training data (one-pass SGD) In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstances.
arXiv Detail & Related papers (2021-05-13T00:52:25Z)
Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers. Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z)
Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds. The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z)
STL-SGD: Speeding Up Local SGD with Stagewise Communication Period [19.691927007250417]
Local gradient descent (Local SGD) has attracted significant attention due to its low communication complexity. STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD. Experiments on both convex and nonfrac problems demonstrate the superior performance STL-SGD.
arXiv Detail & Related papers (2020-06-11T12:48:17Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.