Nested Gradient Codes for Straggler Mitigation in Distributed Machine
Learning
- URL: http://arxiv.org/abs/2212.08580v1
- Date: Fri, 16 Dec 2022 16:56:51 GMT
- Title: Nested Gradient Codes for Straggler Mitigation in Distributed Machine
Learning
- Authors: Luis Ma{\ss}ny, Christoph Hofmeister, Maximilian Egger, Rawad Bitar,
Antonia Wachter-Zeh
- Abstract summary: Gradient codes are designed to tolerate a fixed number of stragglers.
We propose a gradient coding scheme that can tolerate a flexible number of stragglers.
By proper task scheduling and small additional signaling, our scheme adapts the load of the workers to the actual number of stragglers.
- Score: 21.319460501659666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider distributed learning in the presence of slow and unresponsive
worker nodes, referred to as stragglers. In order to mitigate the effect of
stragglers, gradient coding redundantly assigns partial computations to the
worker such that the overall result can be recovered from only the
non-straggling workers. Gradient codes are designed to tolerate a fixed number
of stragglers. Since the number of stragglers in practice is random and unknown
a priori, tolerating a fixed number of stragglers can yield a sub-optimal
computation load and can result in higher latency. We propose a gradient coding
scheme that can tolerate a flexible number of stragglers by carefully
concatenating gradient codes for different straggler tolerance. By proper task
scheduling and small additional signaling, our scheme adapts the computation
load of the workers to the actual number of stragglers. We analyze the latency
of our proposed scheme and show that it has a significantly lower latency than
gradient codes.
Related papers
- DEFT: Exploiting Gradient Norm Difference between Model Layers for
Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning.
We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers.
We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z) - Optimization-based Block Coordinate Gradient Coding for Mitigating
Partial Stragglers in Distributed Learning [58.91954425047425]
This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning.
We propose a gradient coordinate coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes.
arXiv Detail & Related papers (2022-06-06T09:25:40Z) - Lightweight Projective Derivative Codes for Compressed Asynchronous
Gradient Descent [6.055286666916789]
This paper proposes a novel algorithm that encodes the partial derivatives themselves and furthermore optimize the codes by performing lossy compression on the derivative codewords.
The utility of this application of coding theory is a geometrical consequence of the observed fact in optimization research that noise is tolerable, sometimes even helpful, in gradient descent based learning algorithms.
arXiv Detail & Related papers (2022-01-31T04:08:53Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Approximate Gradient Coding with Optimal Decoding [22.069731881164895]
In distributed optimization problems, a technique called gradient coding has been used to mitigate the effect of straggling machines.
Recent work has studied approximate gradient coding, which concerns coding schemes where the replication factor of the data is too low to recover the full gradient exactly.
We introduce novel approximate gradient codes based on expander graphs, in which each machine receives exactly two blocks of data points.
arXiv Detail & Related papers (2020-06-17T03:46:27Z) - Age-Based Coded Computation for Bias Reduction in Distributed Learning [57.9123881133818]
Coded computation can be used to speed up distributed learning in the presence of straggling workers.
Partial recovery of the gradient vector can further reduce the computation time at each iteration.
Estimator bias will be particularly prevalent when the straggling behavior is correlated over time.
arXiv Detail & Related papers (2020-06-02T17:51:11Z) - Communication-Efficient Gradient Coding for Straggler Mitigation in
Distributed Learning [17.454251607446555]
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations.
Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance.
We develop a communication-efficient gradient coding framework to overcome these drawbacks.
arXiv Detail & Related papers (2020-05-14T17:57:13Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.