Sequential Gradient Coding For Straggler Mitigation
- URL: http://arxiv.org/abs/2211.13802v2
- Date: Wed, 28 Jun 2023 14:42:36 GMT
- Title: Sequential Gradient Coding For Straggler Mitigation
- Authors: M. Nikhil Krishnan, MohammadReza Ebrahimi, Ashish Khisti
- Abstract summary: In distributed computing, slower nodes (stragglers) usually become a bottleneck.
Gradient Coding (GC) is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers.
We propose two schemes that demonstrate improved performance compared to GC.
- Score: 28.090458692750023
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In distributed computing, slower nodes (stragglers) usually become a
bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient
technique that uses principles of error-correcting codes to distribute gradient
computation in the presence of stragglers. In this paper, we consider the
distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$,
where processing of each gradient $g(t)$ starts in round-$t$ and finishes by
round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme,
coding is only across computing nodes and this results in a solution where
$T=0$. On the other hand, having $T>0$ allows for designing schemes which
exploit the temporal dimension as well. In this work, we propose two schemes
that demonstrate improved performance compared to GC. Our first scheme combines
GC with selective repetition of previously unfinished tasks and achieves
improved straggler mitigation. In our second scheme, which constitutes our main
contribution, we apply GC to a subset of the tasks and repetition for the
remainder of the tasks. We then multiplex these two classes of tasks across
workers and rounds in an adaptive manner, based on past straggler patterns.
Using theoretical analysis, we demonstrate that our second scheme achieves
significant reduction in the computational load. In our experiments, we study a
practical setting of concurrently training multiple neural networks over an AWS
Lambda cluster involving 256 worker nodes, where our framework naturally
applies. We demonstrate that the latter scheme can yield a 16\% improvement in
runtime over the baseline GC scheme, in the presence of naturally occurring,
non-simulated stragglers.
Related papers
- Rethinking and Accelerating Graph Condensation: A Training-Free Approach with Class Partition [56.26113670151363]
Graph condensation is a data-centric solution to replace the large graph with a small yet informative condensed graph.
Existing GC methods suffer from intricate optimization processes, necessitating excessive computing resources.
We propose a training-free GC framework termed Class-partitioned Graph Condensation (CGC)
CGC achieves state-of-the-art performance with a more efficient condensation process.
arXiv Detail & Related papers (2024-05-22T14:57:09Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness [0.0]
We consider the average of a very large number of smooth possibly non-size functions, and we use two widely minibatch frameworks to tackle this problem.
We define ease-controlled modifications of IG/RR schemes, which require a light additional computational effort.
We prove our implementation with both a full batch gradient (i.e. L-BFGS) and an implementation of IG/RR methods, proving that algorithms require a similar computational effort.
arXiv Detail & Related papers (2022-12-04T15:26:36Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - Gradient Centralization: A New Optimization Technique for Deep Neural
Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean.
GC can be viewed as a projected gradient descent method with a constrained loss function.
GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.