Related papers: Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

URL: http://arxiv.org/abs/2208.03134v1
Date: Thu, 4 Aug 2022 10:57:25 GMT
Title: Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning
Authors: Serge Kas Hanna and Rawad Bitar and Parimal Parag and Venkat Dasari and Salim El Rouayheb
Abstract summary: We consider the setting where a master wants to run a distributed descent (SGD) algorithm on $n$ workers. We show that the adaptive version of distributed SGD can reach lower error values in less time compared to non-adaptive implementations.
Score: 33.590006101071765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers, each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updating the model, where $k$ is a fixed parameter. The choice of the value of $k$ presents a trade-off between the runtime (i.e., convergence rate) of SGD and the error of the model. Towards optimizing the error-runtime trade-off, we investigate distributed SGD with adaptive~$k$, i.e., varying $k$ throughout the runtime of the algorithm. We first design an adaptive policy for varying $k$ that optimizes this trade-off based on an upper bound on the error as a function of the wall-clock time that we derive. Then, we propose and implement an algorithm for adaptive distributed SGD that is based on a statistical heuristic. Our results show that the adaptive version of distributed SGD can reach lower error values in less time compared to non-adaptive implementations. Moreover, the results also show that the adaptive version is communication-efficient, where the amount of communication required between the master and the workers is less than that of non-adaptive versions.

Related papers

DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server. To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z)
Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents [0.0]
This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration.
arXiv Detail & Related papers (2022-10-25T22:04:49Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting. By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem. We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
Guided parallelized stochastic gradient descent for delay compensation [0.0]
gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. With the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its natural behavior of sequential optimization of the error function. This has led to the development of parallel SGD algorithms, such as asynchronous SGD (ASGD) and synchronous SGD (SSGD) to train deep neural networks.
arXiv Detail & Related papers (2021-01-17T23:12:40Z)
Avoiding Communication in Logistic Regression [1.7780157772002312]
gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. In a parallel setting, SGD requires interprocess communication at every iteration. We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD.
arXiv Detail & Related papers (2020-11-16T21:14:39Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Private Stochastic Non-Convex Optimization: Adaptive Algorithms and Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization. We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)
Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers [31.309253646602933]
We consider the setting where a master wants to run a distributed gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. We propose an algorithm for adaptive distributed SGD that is based on a statistical notion.
arXiv Detail & Related papers (2020-02-25T16:25:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.