Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning
- URL: http://arxiv.org/abs/2103.01206v1
- Date: Mon, 1 Mar 2021 18:51:29 GMT
- Title: Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning
- Authors: Baturalp Buyukates and Emre Ozfatura and Sennur Ulukus and Deniz
Gunduz
- Abstract summary: gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
- Score: 55.052517095437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed implementations are crucial in speeding up large scale machine
learning applications. Distributed gradient descent (GD) is widely employed to
parallelize the learning task by distributing the dataset across multiple
workers. A significant performance bottleneck for the per-iteration completion
time in distributed synchronous GD is $straggling$ workers. Coded distributed
computation techniques have been introduced recently to mitigate stragglers and
to speed up GD iterations by assigning redundant computations to workers. In
this paper, we consider gradient coding (GC), and propose a novel dynamic GC
scheme, which assigns redundant data to workers to acquire the flexibility to
dynamically choose from among a set of possible codes depending on the past
straggling behavior. In particular, we consider GC with clustering, and
regulate the number of stragglers in each cluster by dynamically forming the
clusters at each iteration; hence, the proposed scheme is called $GC$ $with$
$dynamic$ $clustering$ (GC-DC). Under a time-correlated straggling behavior,
GC-DC gains from adapting to the straggling behavior over time such that, at
each iteration, GC-DC aims at distributing the stragglers across clusters as
uniformly as possible based on the past straggler behavior. For both
homogeneous and heterogeneous worker models, we numerically show that GC-DC
provides significant improvements in the average per-iteration completion time
without an increase in the communication load compared to the original GC
scheme.
Related papers
- Rethinking and Accelerating Graph Condensation: A Training-Free Approach with Class Partition [56.26113670151363]
Graph condensation is a data-centric solution to replace the large graph with a small yet informative condensed graph.
Existing GC methods suffer from intricate optimization processes, necessitating excessive computing resources.
We propose a training-free GC framework termed Class-partitioned Graph Condensation (CGC)
CGC achieves state-of-the-art performance with a more efficient condensation process.
arXiv Detail & Related papers (2024-05-22T14:57:09Z) - ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm
with Adaptive Batch Size for Heterogeneous GPU Clusters [9.885668723959125]
We propose a delayed synchronous distributed gradient descent algorithm with adaptive batch size (ABS-SGD) for heterogeneous GPU clusters.
In ABS-SGD, workers perform global synchronization to accumulate delayed gradients and use the accumulated delayed gradients to update parameters.
Extensive experiments in three types of heterogeneous clusters demonstrate that ABS-SGD can make full use of computational resources.
arXiv Detail & Related papers (2023-08-29T09:46:52Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - Sequential Gradient Coding For Straggler Mitigation [28.090458692750023]
In distributed computing, slower nodes (stragglers) usually become a bottleneck.
Gradient Coding (GC) is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers.
We propose two schemes that demonstrate improved performance compared to GC.
arXiv Detail & Related papers (2022-11-24T21:12:49Z) - Rethinking and Scaling Up Graph Contrastive Learning: An Extremely
Efficient Approach with Group Discrimination [87.07410882094966]
Graph contrastive learning (GCL) alleviates the heavy reliance on label information for graph representation learning (GRL)
We introduce a new learning paradigm for self-supervised GRL, namely, Group Discrimination (GD)
Instead of similarity computation, GGD directly discriminates two groups of summarised node instances with a simple binary cross-entropy loss.
In addition, GGD requires much fewer training epochs to obtain competitive performance compared with GCL methods on large-scale datasets.
arXiv Detail & Related papers (2022-06-03T12:32:47Z) - CGC: Contrastive Graph Clustering for Community Detection and Tracking [33.48636823444052]
We develop CGC, a novel end-to-end framework for graph clustering.
CGC learns node embeddings and cluster assignments in a contrastive graph learning framework.
We extend CGC for time-evolving data, where temporal graph clustering is performed in an incremental learning fashion.
arXiv Detail & Related papers (2022-04-05T17:34:47Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - Online Deep Clustering for Unsupervised Representation Learning [108.33534231219464]
Online Deep Clustering (ODC) performs clustering and network update simultaneously rather than alternatingly.
We design and maintain two dynamic memory modules, i.e., samples memory to store samples labels and features, and centroids memory for centroids evolution.
In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly.
arXiv Detail & Related papers (2020-06-18T16:15:46Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.