Adaptive Stochastic Gradient Descent for Fast and
Communication-Efficient Distributed Learning
- URL: http://arxiv.org/abs/2208.03134v1
- Date: Thu, 4 Aug 2022 10:57:25 GMT
- Title: Adaptive Stochastic Gradient Descent for Fast and
Communication-Efficient Distributed Learning
- Authors: Serge Kas Hanna and Rawad Bitar and Parimal Parag and Venkat Dasari
and Salim El Rouayheb
- Abstract summary: We consider the setting where a master wants to run a distributed descent (SGD) algorithm on $n$ workers.
We show that the adaptive version of distributed SGD can reach lower error values in less time compared to non-adaptive implementations.
- Score: 33.590006101071765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the setting where a master wants to run a distributed stochastic
gradient descent (SGD) algorithm on $n$ workers, each having a subset of the
data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or
unresponsive workers who cause delays. One solution studied in the literature
is to wait at each iteration for the responses of the fastest $k<n$ workers
before updating the model, where $k$ is a fixed parameter. The choice of the
value of $k$ presents a trade-off between the runtime (i.e., convergence rate)
of SGD and the error of the model. Towards optimizing the error-runtime
trade-off, we investigate distributed SGD with adaptive~$k$, i.e., varying $k$
throughout the runtime of the algorithm. We first design an adaptive policy for
varying $k$ that optimizes this trade-off based on an upper bound on the error
as a function of the wall-clock time that we derive. Then, we propose and
implement an algorithm for adaptive distributed SGD that is based on a
statistical heuristic. Our results show that the adaptive version of
distributed SGD can reach lower error values in less time compared to
non-adaptive implementations. Moreover, the results also show that the adaptive
version is communication-efficient, where the amount of communication required
between the master and the workers is less than that of non-adaptive versions.
Related papers
- DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Federated Learning Using Variance Reduced Stochastic Gradient for
Probabilistically Activated Agents [0.0]
This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration.
arXiv Detail & Related papers (2022-10-25T22:04:49Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Guided parallelized stochastic gradient descent for delay compensation [0.0]
gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models.
With the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its natural behavior of sequential optimization of the error function.
This has led to the development of parallel SGD algorithms, such as asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural networks.
arXiv Detail & Related papers (2021-01-17T23:12:40Z) - Avoiding Communication in Logistic Regression [1.7780157772002312]
gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems.
In a parallel setting, SGD requires interprocess communication at every iteration.
We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD.
arXiv Detail & Related papers (2020-11-16T21:14:39Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in
the Presence of Stragglers [31.309253646602933]
We consider the setting where a master wants to run a distributed gradient descent (SGD) algorithm on $n$ workers each having a subset of the data.
Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays.
We propose an algorithm for adaptive distributed SGD that is based on a statistical notion.
arXiv Detail & Related papers (2020-02-25T16:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.