Related papers: Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

URL: http://arxiv.org/abs/2102.06280v1
Date: Thu, 11 Feb 2021 21:39:53 GMT
Title: Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers
Authors: Guojun Xiong, Gang Yan, Rahul Singh, Jian Li
Abstract summary: We propose a fully distributed algorithm to determine the number of backup workers for each worker. Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
Score: 9.919012793724628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase can be time consuming due to the need to wait for \textit{stragglers}, i.e., slower workers. An efficient way to mitigate this effect is to let each worker wait only for updates from the fastest neighbors before updating its local parameter. The remaining neighbors are called \textit{backup workers.} To minimize the globally training time over the network, we propose a fully distributed algorithm to dynamically determine the number of backup workers for each worker. We show that our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

Related papers

DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server. To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z)
Timely Asynchronous Hierarchical Federated Learning: Age of Convergence [59.96266198512243]
We consider an asynchronous hierarchical federated learning setting with a client-edge-cloud framework. The clients exchange the trained parameters with their corresponding edge servers, which update the locally aggregated model. The goal of each client is to converge to the global model, while maintaining timeliness of the clients.
arXiv Detail & Related papers (2023-06-21T17:39:16Z)
Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework. We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z)
Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z)
STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization [14.526055067546507]
Local synchronization suffers from some workers being idle random delays due to slow and straggler workers, as it waits for the workers to complete the same amount of local updates. In this paper, to mitigate stragglers and improve communication efficiency, a novel local SGD system strategy, named STSyn, is developed.
arXiv Detail & Related papers (2022-10-06T08:04:20Z)
Acceleration of Federated Learning with Alleviated Forgetting in Local Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy. We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage. Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z)
Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers. Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z)
Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm. Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Dynamic backup workers for parallel machine learning [10.813576865492767]
We propose an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune $b$ by preliminary time-consuming experiments, and 2) makes the training up to a factor $3$ faster than the optimal static configuration.
arXiv Detail & Related papers (2020-04-30T11:25:00Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
Decentralized gradient methods: does topology matter? [23.1803761184887]
This paper shows how each worker maintains a local estimate of the optimal parameter vector and iteratively updates it by averaging the estimates obtained from its neighbors, and applying a correction on the basis of its local dataset. While theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge, previous experiments have shown the opposite conclusion. This paper sheds lights on this apparent contradiction and show how sparse topologies can lead to faster convergence even in the absence of communication delays.
arXiv Detail & Related papers (2020-02-28T12:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.