Related papers: Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

URL: http://arxiv.org/abs/2212.08580v1
Date: Fri, 16 Dec 2022 16:56:51 GMT
Title: Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning
Authors: Luis Ma{\ss}ny, Christoph Hofmeister, Maximilian Egger, Rawad Bitar, Antonia Wachter-Zeh
Abstract summary: Gradient codes are designed to tolerate a fixed number of stragglers. We propose a gradient coding scheme that can tolerate a flexible number of stragglers. By proper task scheduling and small additional signaling, our scheme adapts the load of the workers to the actual number of stragglers.
Score: 21.319460501659666
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.

Related papers

DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning. We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers. We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z)
Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning [58.91954425047425]
This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. We propose a gradient coordinate coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes.
arXiv Detail & Related papers (2022-06-06T09:25:40Z)
Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent [6.055286666916789]
This paper proposes a novel algorithm that encodes the partial derivatives themselves and furthermore optimize the codes by performing lossy compression on the derivative codewords. The utility of this application of coding theory is a geometrical consequence of the observed fact in optimization research that noise is tolerable, sometimes even helpful, in gradient descent based learning algorithms.
arXiv Detail & Related papers (2022-01-31T04:08:53Z)
Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
Approximate Gradient Coding with Optimal Decoding [22.069731881164895]
In distributed optimization problems, a technique called gradient coding has been used to mitigate the effect of straggling machines. Recent work has studied approximate gradient coding, which concerns coding schemes where the replication factor of the data is too low to recover the full gradient exactly. We introduce novel approximate gradient codes based on expander graphs, in which each machine receives exactly two blocks of data points.
arXiv Detail & Related papers (2020-06-17T03:46:27Z)
Age-Based Coded Computation for Bias Reduction in Distributed Learning [57.9123881133818]
Coded computation can be used to speed up distributed learning in the presence of straggling workers. Partial recovery of the gradient vector can further reduce the computation time at each iteration. Estimator bias will be particularly prevalent when the straggling behavior is correlated over time.
arXiv Detail & Related papers (2020-06-02T17:51:11Z)
Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning [17.454251607446555]
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations. Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. We develop a communication-efficient gradient coding framework to overcome these drawbacks.
arXiv Detail & Related papers (2020-05-14T17:57:13Z)
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations. In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.