Coded Distributed Computing with Partial Recovery
- URL: http://arxiv.org/abs/2007.02191v2
- Date: Mon, 6 Dec 2021 16:04:39 GMT
- Title: Coded Distributed Computing with Partial Recovery
- Authors: Emre Ozfatura and Sennur Ulukus and Deniz Gunduz
- Abstract summary: We introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR)
CCPR reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation.
We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery.
- Score: 56.08535873173518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Coded computation techniques provide robustness against straggling workers in
distributed computing. However, most of the existing schemes require exact
provisioning of the straggling behaviour and ignore the computations carried
out by straggling workers. Moreover, these schemes are typically designed to
recover the desired computation results accurately, while in many machine
learning and iterative optimization algorithms, faster approximate solutions
are known to result in an improvement in the overall convergence time. In this
paper, we first introduce a novel coded matrix-vector multiplication scheme,
called coded computation with partial recovery (CCPR), which benefits from the
advantages of both coded and uncoded computation schemes, and reduces both the
computation time and the decoding complexity by allowing a trade-off between
the accuracy and the speed of computation. We then extend this approach to
distributed implementation of more general computation tasks by proposing a
coded communication scheme with partial recovery, where the results of subtasks
computed by the workers are coded before being communicated. Numerical
simulations on a large linear regression task confirm the benefits of the
proposed distributed computation scheme with partial recovery in terms of the
trade-off between the computation accuracy and latency.
Related papers
- Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations.
We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery.
We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z) - Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z) - Optimization-based Block Coordinate Gradient Coding for Mitigating
Partial Stragglers in Distributed Learning [58.91954425047425]
This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning.
We propose a gradient coordinate coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes.
arXiv Detail & Related papers (2022-06-06T09:25:40Z) - Parallel Scheduling Self-attention Mechanism: Generalization and
Optimization [0.76146285961466]
We propose a general scheduling algorithm, which is derived from the optimum scheduling for small instances solved by a satisfiability checking(SAT) solver.
Strategies for further optimization on skipping redundant computations are put forward as well, with which reductions of almost 25% and 50% of the original computations are respectively achieved.
The proposed algorithms are applicable regardless of problem sizes, as long as the number of input vectors is divisible to the number of computing units available in the architecture.
arXiv Detail & Related papers (2020-12-02T12:04:16Z) - Iterative Algorithm Induced Deep-Unfolding Neural Networks: Precoding
Design for Multiuser MIMO Systems [59.804810122136345]
We propose a framework for deep-unfolding, where a general form of iterative algorithm induced deep-unfolding neural network (IAIDNN) is developed.
An efficient IAIDNN based on the structure of the classic weighted minimum mean-square error (WMMSE) iterative algorithm is developed.
We show that the proposed IAIDNN efficiently achieves the performance of the iterative WMMSE algorithm with reduced computational complexity.
arXiv Detail & Related papers (2020-06-15T02:57:57Z) - Age-Based Coded Computation for Bias Reduction in Distributed Learning [57.9123881133818]
Coded computation can be used to speed up distributed learning in the presence of straggling workers.
Partial recovery of the gradient vector can further reduce the computation time at each iteration.
Estimator bias will be particularly prevalent when the straggling behavior is correlated over time.
arXiv Detail & Related papers (2020-06-02T17:51:11Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.