Coded Computation across Shared Heterogeneous Workers with Communication
Delay
- URL: http://arxiv.org/abs/2109.11246v1
- Date: Thu, 23 Sep 2021 09:40:54 GMT
- Title: Coded Computation across Shared Heterogeneous Workers with Communication
Delay
- Authors: Yuxuan Sun, Fan Zhang, Junlin Zhao, Sheng Zhou, Zhisheng Niu, Deniz
G\"und\"uz
- Abstract summary: We consider a multi-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation.
We propose worker assignment, resource allocation load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded encoded tasks.
We show that the proposed algorithms can reduce the task delay completion compared to benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
- Score: 42.50248255900814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed computing enables large-scale computation tasks to be processed
over multiple workers in parallel. However, the randomness of communication and
computation delays across workers causes the straggler effect, which may
degrade the performance. Coded computation helps to mitigate the straggler
effect, but the amount of redundant load and their assignment to the workers
should be carefully optimized. In this work, we consider a multi-master
heterogeneous-worker distributed computing scenario, where multiple matrix
multiplication tasks are encoded and allocated to workers for parallel
computation. The goal is to minimize the communication plus computation delay
of the slowest task. We propose worker assignment, resource allocation and load
allocation algorithms under both dedicated and fractional worker assignment
policies, where each worker can process the encoded tasks of either a single
master or multiple masters, respectively. Then, the non-convex delay
minimization problem is solved by employing the Markov's inequality-based
approximation, Karush-Kuhn-Tucker conditions, and successive convex
approximation methods. Through extensive simulations, we show that the proposed
algorithms can reduce the task completion delay compared to the benchmarks, and
observe that dedicated and fractional worker assignment policies have different
scopes of applications.
Related papers
- ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models.
Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Efficient Distributed Machine Learning via Combinatorial Multi-Armed
Bandits [23.289979018463406]
We consider a distributed gradient descent problem where a main node distributes gradient calculations among $n$ workers from which at most $b leq n$ can be utilized in parallel.
By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves.
This strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers.
We propose a cost-efficient scheme that assigns tasks only to $k$
arXiv Detail & Related papers (2022-02-16T19:18:19Z) - A Machine Learning Approach for Task and Resource Allocation in Mobile
Edge Computing Based Networks [108.57859531628264]
A joint task, spectrum, and transmit power allocation problem is investigated for a wireless network.
The proposed algorithm can reduce the number of iterations needed for convergence and the maximal delay among all users by up to 18% and 11.1% compared to the standard Q-learning algorithm.
arXiv Detail & Related papers (2020-07-20T13:46:42Z) - Coded Distributed Computing with Partial Recovery [56.08535873173518]
We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR)
CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation.
We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
arXiv Detail & Related papers (2020-07-04T21:34:49Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.