LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient
Distributed Learning
- URL: http://arxiv.org/abs/2002.11360v1
- Date: Wed, 26 Feb 2020 08:58:54 GMT
- Title: LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient
Distributed Learning
- Authors: Tianyi Chen, Yuejiao Sun, Wotao Yin
- Abstract summary: This paper targets solving distributed machine learning problems such as federated learning in a communication-efficient fashion.
A class of new gradient descent (SGD) approaches have been developed, which can be viewed as a generalization to the recently developed lazily aggregated gradient (LAG) method.
The key components of LASG are a set of new rules tailored for gradients that can be implemented either to save download, upload, or both.
- Score: 47.93365664380274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper targets solving distributed machine learning problems such as
federated learning in a communication-efficient fashion. A class of new
stochastic gradient descent (SGD) approaches have been developed, which can be
viewed as the stochastic generalization to the recently developed lazily
aggregated gradient (LAG) method --- justifying the name LASG. LAG adaptively
predicts the contribution of each round of communication and chooses only the
significant ones to perform. It saves communication while also maintains the
rate of convergence. However, LAG only works with deterministic gradients, and
applying it to stochastic gradients yields poor performance. The key components
of LASG are a set of new rules tailored for stochastic gradients that can be
implemented either to save download, upload, or both. The new algorithms
adaptively choose between fresh and stale stochastic gradients and have
convergence rates comparable to the original SGD. LASG achieves impressive
empirical performance --- it typically saves total communication by an order of
magnitude.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance.
numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z) - Implicit Gradient Alignment in Distributed and Federated Learning [39.61762498388211]
A major obstacle to achieving global convergence in distributed and federated learning is misalignment of gradients across clients.
We propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update.
arXiv Detail & Related papers (2021-06-25T22:01:35Z) - Cogradient Descent for Dependable Learning [64.02052988844301]
We propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem.
CoGD is introduced to solve bilinear problems when one variable is with sparsity constraint.
It can also be used to decompose the association of features and weights, which further generalizes our method to better train convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-06-20T04:28:20Z) - CADA: Communication-Adaptive Distributed Adam [31.02472517086767]
gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning.
This paper proposes an adaptive gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method.
arXiv Detail & Related papers (2020-12-31T06:52:18Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Federated Stochastic Gradient Langevin Dynamics [12.180900849847252]
gradient MCMC methods, such as gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling.
We propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates.
We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails.
arXiv Detail & Related papers (2020-04-23T15:25:09Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.