STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization
- URL: http://arxiv.org/abs/2210.03521v3
- Date: Mon, 29 May 2023 11:58:57 GMT
- Title: STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization
- Authors: Feng Zhu, Jingjing Zhang and Xin Wang
- Abstract summary: Local synchronization suffers from some workers being idle random delays due to slow and straggler workers, as it waits for the workers to complete the same amount of local updates.
In this paper, to mitigate stragglers and improve communication efficiency, a novel local SGD system strategy, named STSyn, is developed.
- Score: 14.526055067546507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synchronous local stochastic gradient descent (local SGD) suffers from some
workers being idle and random delays due to slow and straggling workers, as it
waits for the workers to complete the same amount of local updates. In this
paper, to mitigate stragglers and improve communication efficiency, a novel
local SGD strategy, named STSyn, is developed. The key point is to wait for the
$K$ fastest workers, while keeping all the workers computing continually at
each synchronization round, and making full use of any effective (completed)
local update of each worker regardless of stragglers. An analysis of the
average wall-clock time, average number of local updates and average number of
uploading workers per round is provided to gauge the performance of STSyn. The
convergence of STSyn is also rigorously established even when the objective
function is nonconvex. Experimental results show the superiority of the
proposed STSyn against state-of-the-art schemes through utilization of the
straggler-tolerant technique and additional effective local updates at each
worker, and the influence of system parameters is studied. By waiting for
faster workers and allowing heterogeneous synchronization with different
numbers of local updates across workers, STSyn provides substantial
improvements both in time and communication efficiency.
Related papers
- Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data [22.917944307972434]
We consider the distributed learning problem with data across multiple workers under the orchestration of a central server.
We introduce the textitdual-delayed asynchronous SGD (DuDe-ASGD) algorithm designed to the adverse effects of data iterations.
DuDe-ASGD makes full use of stale gradients from all workers during asynchronous training, leading to two time lags in the model parameters and data samples utilized in the servers.
arXiv Detail & Related papers (2024-05-27T09:00:30Z) - Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations [92.1840862558718]
In practical distributed systems, workers typically not homogeneous, and can have highly varying processing times.
We introduce a new parallel method Freya to handle arbitrarily slow computations.
We show that Freya offers significantly improved complexity guarantees compared to all previous methods.
arXiv Detail & Related papers (2024-05-24T13:33:30Z) - Shadowheart SGD: Distributed Asynchronous SGD with Optimal Time Complexity Under Arbitrary Computation and Communication Heterogeneity [85.92481138826949]
We develop a new method-Shadowheart SGD-that provably improves the time complexities of all previous centralized methods.
We also consider the bidirectional setup, where broadcasting from the server to the workers is non-negligible, and develop a corresponding method.
arXiv Detail & Related papers (2024-02-07T12:15:56Z) - Asynchronous Local-SGD Training for Language Modeling [37.02427878640653]
Local gradient descent (Local-SGD) is an approach to distributed optimization where each device performs more than one SGD update per communication.
This work presents an empirical study of it asynchronous Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps.
arXiv Detail & Related papers (2024-01-17T11:17:04Z) - AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms [45.90015262911875]
We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting.
As a by-product of our analysis, we also demonstrate guarantees for gradient-type algorithms such as SGD with random tightness.
arXiv Detail & Related papers (2023-10-31T13:44:53Z) - Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework.
We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with.
We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z) - FedSpeed: Larger Local Interval, Less Communication Round, and Higher
Generalization Accuracy [84.45004766136663]
Federated learning is an emerging distributed machine learning framework.
It suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting.
We propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems.
arXiv Detail & Related papers (2023-02-21T03:55:29Z) - Straggler-Resilient Distributed Machine Learning with Dynamic Backup
Workers [9.919012793724628]
We propose a fully distributed algorithm to determine the number of backup workers for each worker.
Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
arXiv Detail & Related papers (2021-02-11T21:39:53Z) - Communication-Efficient Robust Federated Learning Over Heterogeneous
Datasets [147.11434031193164]
This work investigates fault-resilient federated learning when the data samples are non-uniformly distributed across workers.
In the presence of adversarially faulty workers who may strategically corrupt datasets, the local messages exchanged can be unreliable.
The present work introduces a fault-resilient gradient (FRPG) algorithm that relies on Nesterov's acceleration technique.
For strongly convex loss functions, FRPG and LFRPG have provably faster convergence rates than a benchmark robust aggregation algorithm.
arXiv Detail & Related papers (2020-06-17T16:50:33Z) - Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity.
VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.