Adaptive Periodic Averaging: A Practical Approach to Reducing
Communication in Distributed Learning
- URL: http://arxiv.org/abs/2007.06134v2
- Date: Tue, 19 Jan 2021 15:45:04 GMT
- Title: Adaptive Periodic Averaging: A Practical Approach to Reducing
Communication in Distributed Learning
- Authors: Peng Jiang, Gagan Agrawal
- Abstract summary: We show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution.
We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters.
- Score: 6.370766463380455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic Gradient Descent (SGD) is the key learning algorithm for many
machine learning tasks. Because of its computational costs, there is a growing
interest in accelerating SGD on HPC resources like GPU clusters. However, the
performance of parallel SGD is still bottlenecked by the high communication
costs even with a fast connection among the machines. A simple approach to
alleviating this problem, used in many existing efforts, is to perform
communication every few iterations, using a constant averaging period. In this
paper, we show that the optimal averaging period in terms of convergence and
communication cost is not a constant, but instead varies over the course of the
execution. Specifically, we observe that reducing the variance of model
parameters among the computing nodes is critical to the convergence of periodic
parameter averaging SGD. Given a fixed communication budget, we show that it is
more beneficial to synchronize more frequently in early iterations to reduce
the initial large variance and synchronize less frequently in the later phase
of the training process. We propose a practical algorithm, named ADaptive
Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall
variance of model parameters, and thus better convergence compared with the
Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with
several image classification benchmarks and show that our ADPSGD indeed
achieves smaller training losses and higher test accuracies with smaller
communication compared with CPSGD. Compared with gradient-quantization SGD, we
show that our algorithm achieves faster convergence with only half of the
communication. Compared with full-communication SGD, our ADPSGD achieves 1:14x
to 1:27x speedups with a 100Gbps connection among computing nodes, and the
speedups increase to 1:46x ~ 1:95x with a 10Gbps connection.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Avoiding Communication in Logistic Regression [1.7780157772002312]
gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems.
In a parallel setting, SGD requires interprocess communication at every iteration.
We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD.
arXiv Detail & Related papers (2020-11-16T21:14:39Z) - O(1) Communication for Distributed SGD through Two-Level Gradient
Averaging [0.0]
We introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker.
Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm.
arXiv Detail & Related papers (2020-06-12T18:20:52Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed
Training [5.888925582071453]
We propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process.
We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2020-05-14T05:33:36Z) - Breaking (Global) Barriers in Parallel Stochastic Optimization with
Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.
We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.