Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off
- URL: http://arxiv.org/abs/2004.04948v1
- Date: Fri, 10 Apr 2020 08:39:36 GMT
- Title: Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off
- Authors: Emre Ozfatura, Sennur Ulukus, Deniz Gunduz
- Abstract summary: Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
- Score: 56.08535873173518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When gradient descent (GD) is scaled to many parallel workers for large scale
machine learning problems, its per-iteration computation time is limited by the
straggling workers. Straggling workers can be tolerated by assigning redundant
computations and coding across data and computations, but in most existing
schemes, each non-straggling worker transmits one message per iteration to the
parameter server (PS) after completing all its computations. Imposing such a
limitation results in two main drawbacks; over-computation due to inaccurate
prediction of the straggling behaviour, and under-utilization due to treating
workers as straggler/non-straggler and discarding partial computations carried
out by stragglers. In this paper, to overcome these drawbacks, we consider
multi-message communication (MMC) by allowing multiple computations to be
conveyed from each worker per iteration, and design straggler avoidance
techniques accordingly. Then, we analyze how the proposed designs can be
employed efficiently to seek a balance between the computation and
communication latency to minimize the overall latency. Furthermore, through
extensive simulations, both model-based and real implementation on Amazon EC2
servers, we identify the advantages and disadvantages of these designs in
different settings, and demonstrate that MMC can help improve upon existing
straggler avoidance schemes.
Related papers
- SignSGD with Federated Voting [69.06621279967865]
SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization.
We propose a novel signSGD with textitfederated voting (signSGD-FV)
The idea of federated voting is to exploit learnable weights to perform weighted majority voting.
We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes.
arXiv Detail & Related papers (2024-03-25T02:32:43Z) - Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z) - Collaborative Learning over Wireless Networks: An Introductory Overview [84.09366153693361]
We will mainly focus on collaborative training across wireless devices.
Many distributed optimization algorithms have been developed over the last decades.
They provide data locality; that is, a joint model can be trained collaboratively while the data available at each participating device remains local.
arXiv Detail & Related papers (2021-12-07T20:15:39Z) - Coded Computation across Shared Heterogeneous Workers with Communication
Delay [42.50248255900814]
We consider a multi-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation.
We propose worker assignment, resource allocation load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded encoded tasks.
We show that the proposed algorithms can reduce the task delay completion compared to benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
arXiv Detail & Related papers (2021-09-23T09:40:54Z) - A Machine Learning Approach for Task and Resource Allocation in Mobile
Edge Computing Based Networks [108.57859531628264]
A joint task, spectrum, and transmit power allocation problem is investigated for a wireless network.
The proposed algorithm can reduce the number of iterations needed for convergence and the maximal delay among all users by up to 18% and 11.1% compared to the standard Q-learning algorithm.
arXiv Detail & Related papers (2020-07-20T13:46:42Z) - Coded Distributed Computing with Partial Recovery [56.08535873173518]
We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR)
CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation.
We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
arXiv Detail & Related papers (2020-07-04T21:34:49Z) - Communication-Efficient Gradient Coding for Straggler Mitigation in
Distributed Learning [17.454251607446555]
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations.
Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance.
We develop a communication-efficient gradient coding framework to overcome these drawbacks.
arXiv Detail & Related papers (2020-05-14T17:57:13Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.