Straggler-Resilient Distributed Machine Learning with Dynamic Backup
Workers
- URL: http://arxiv.org/abs/2102.06280v1
- Date: Thu, 11 Feb 2021 21:39:53 GMT
- Title: Straggler-Resilient Distributed Machine Learning with Dynamic Backup
Workers
- Authors: Guojun Xiong, Gang Yan, Rahul Singh, Jian Li
- Abstract summary: We propose a fully distributed algorithm to determine the number of backup workers for each worker.
Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
- Score: 9.919012793724628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing demand for large-scale training of machine learning
models, consensus-based distributed optimization methods have recently been
advocated as alternatives to the popular parameter server framework. In this
paradigm, each worker maintains a local estimate of the optimal parameter
vector, and iteratively updates it by waiting and averaging all estimates
obtained from its neighbors, and then corrects it on the basis of its local
dataset. However, the synchronization phase can be time consuming due to the
need to wait for \textit{stragglers}, i.e., slower workers. An efficient way to
mitigate this effect is to let each worker wait only for updates from the
fastest neighbors before updating its local parameter. The remaining neighbors
are called \textit{backup workers.} To minimize the globally training time over
the network, we propose a fully distributed algorithm to dynamically determine
the number of backup workers for each worker. We show that our algorithm
achieves a linear speedup for convergence (i.e., convergence performance
increases linearly with respect to the number of workers). We conduct extensive
experiments on MNIST and CIFAR-10 to verify our theoretical results.
Related papers
- DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Timely Asynchronous Hierarchical Federated Learning: Age of Convergence [59.96266198512243]
We consider an asynchronous hierarchical federated learning setting with a client-edge-cloud framework.
The clients exchange the trained parameters with their corresponding edge servers, which update the locally aggregated model.
The goal of each client is to converge to the global model, while maintaining timeliness of the clients.
arXiv Detail & Related papers (2023-06-21T17:39:16Z) - Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework.
We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with.
We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z) - Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z) - STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization [14.526055067546507]
Local synchronization suffers from some workers being idle random delays due to slow and straggler workers, as it waits for the workers to complete the same amount of local updates.
In this paper, to mitigate stragglers and improve communication efficiency, a novel local SGD system strategy, named STSyn, is developed.
arXiv Detail & Related papers (2022-10-06T08:04:20Z) - Acceleration of Federated Learning with Alleviated Forgetting in Local
Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy.
We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage.
Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z) - Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm.
Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Dynamic backup workers for parallel machine learning [10.813576865492767]
We propose an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration.
Our experiments show that DBW 1) removes the necessity to tune $b$ by preliminary time-consuming experiments, and 2) makes the training up to a factor $3$ faster than the optimal static configuration.
arXiv Detail & Related papers (2020-04-30T11:25:00Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.