Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load
- URL: http://arxiv.org/abs/2304.08589v1
- Date: Mon, 17 Apr 2023 20:12:18 GMT
- Title: Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load
- Authors: Maximilian Egger, Serge Kas Hanna and Rawad Bitar
- Abstract summary: optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
- Score: 11.069252535469644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In distributed machine learning, a central node outsources computationally
expensive calculations to external worker nodes. The properties of optimization
procedures like stochastic gradient descent (SGD) can be leveraged to mitigate
the effect of unresponsive or slow workers called stragglers, that otherwise
degrade the benefit of outsourcing the computation. This can be done by only
waiting for a subset of the workers to finish their computation at each
iteration of the algorithm. Previous works proposed to adapt the number of
workers to wait for as the algorithm evolves to optimize the speed of
convergence. In contrast, we model the communication and computation times
using independent random variables. Considering this model, we construct a
novel scheme that adapts both the number of workers and the computation load
throughout the run-time of the algorithm. Consequently, we improve the
convergence speed of distributed SGD while significantly reducing the
computation load, at the expense of a slight increase in communication load.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Efficient Distributed Machine Learning via Combinatorial Multi-Armed
Bandits [23.289979018463406]
We consider a distributed gradient descent problem where a main node distributes gradient calculations among $n$ workers from which at most $b leq n$ can be utilized in parallel.
By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves.
This strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers.
We propose a cost-efficient scheme that assigns tasks only to $k$
arXiv Detail & Related papers (2022-02-16T19:18:19Z) - Coded Computation across Shared Heterogeneous Workers with Communication
Delay [42.50248255900814]
We consider a multi-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation.
We propose worker assignment, resource allocation load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded encoded tasks.
We show that the proposed algorithms can reduce the task delay completion compared to benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
arXiv Detail & Related papers (2021-09-23T09:40:54Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - Coded Distributed Computing with Partial Recovery [56.08535873173518]
We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR)
CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation.
We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
arXiv Detail & Related papers (2020-07-04T21:34:49Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.