Related papers: STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

URL: http://arxiv.org/abs/2210.03521v3
Date: Mon, 29 May 2023 11:58:57 GMT
Title: STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization
Authors: Feng Zhu, Jingjing Zhang and Xin Wang
Abstract summary: Local synchronization suffers from some workers being idle random delays due to slow and straggler workers, as it waits for the workers to complete the same amount of local updates. In this paper, to mitigate stragglers and improve communication efficiency, a novel local SGD system strategy, named STSyn, is developed.
Score: 14.526055067546507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synchronous local stochastic gradient descent (local SGD) suffers from some workers being idle and random delays due to slow and straggling workers, as it waits for the workers to complete the same amount of local updates. In this paper, to mitigate stragglers and improve communication efficiency, a novel local SGD strategy, named STSyn, is developed. The key point is to wait for the $K$ fastest workers, while keeping all the workers computing continually at each synchronization round, and making full use of any effective (completed) local update of each worker regardless of stragglers. An analysis of the average wall-clock time, average number of local updates and average number of uploading workers per round is provided to gauge the performance of STSyn. The convergence of STSyn is also rigorously established even when the objective function is nonconvex. Experimental results show the superiority of the proposed STSyn against state-of-the-art schemes through utilization of the straggler-tolerant technique and additional effective local updates at each worker, and the influence of system parameters is studied. By waiting for faster workers and allowing heterogeneous synchronization with different numbers of local updates across workers, STSyn provides substantial improvements both in time and communication efficiency.

Related papers

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch [66.84195842685459]
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint. We show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before.
arXiv Detail & Related papers (2025-01-30T17:23:50Z)
Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity [92.1840862558718]
Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
arXiv Detail & Related papers (2025-01-27T16:07:26Z)
Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data [22.917944307972434]
We consider the distributed learning problem with data across multiple workers under the orchestration of a central server. We introduce the textitdual-delayed asynchronous SGD (DuDe-ASGD) algorithm designed to the adverse effects of data iterations. DuDe-ASGD makes full use of stale gradients from all workers during asynchronous training, leading to two time lags in the model parameters and data samples utilized in the servers.
arXiv Detail & Related papers (2024-05-27T09:00:30Z)
Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations [92.1840862558718]
In practical distributed systems, workers typically not homogeneous, and can have highly varying processing times. We introduce a new parallel method Freya to handle arbitrarily slow computations. We show that Freya offers significantly improved complexity guarantees compared to all previous methods.
arXiv Detail & Related papers (2024-05-24T13:33:30Z)
Shadowheart SGD: Distributed Asynchronous SGD with Optimal Time Complexity Under Arbitrary Computation and Communication Heterogeneity [85.92481138826949]
We develop a new method-Shadowheart SGD-that provably improves the time complexities of all previous centralized methods. We also consider the bidirectional setup, where broadcasting from the server to the workers is non-negligible, and develop a corresponding method.
arXiv Detail & Related papers (2024-02-07T12:15:56Z)
Asynchronous Local-SGD Training for Language Modeling [37.02427878640653]
Local gradient descent (Local-SGD) is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of it asynchronous Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps.
arXiv Detail & Related papers (2024-01-17T11:17:04Z)
AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms [45.90015262911875]
We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting. As a by-product of our analysis, we also demonstrate guarantees for gradient-type algorithms such as SGD with random tightness.
arXiv Detail & Related papers (2023-10-31T13:44:53Z)
Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework. We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z)
FedSpeed: Larger Local Interval, Less Communication Round, and Higher Generalization Accuracy [84.45004766136663]
Federated learning is an emerging distributed machine learning framework. It suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. We propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems.
arXiv Detail & Related papers (2023-02-21T03:55:29Z)
Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers [9.919012793724628]
We propose a fully distributed algorithm to determine the number of backup workers for each worker. Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
arXiv Detail & Related papers (2021-02-11T21:39:53Z)
Communication-Efficient Robust Federated Learning Over Heterogeneous Datasets [147.11434031193164]
This work investigates fault-resilient federated learning when the data samples are non-uniformly distributed across workers. In the presence of adversarially faulty workers who may strategically corrupt datasets, the local messages exchanged can be unreliable. The present work introduces a fault-resilient gradient (FRPG) algorithm that relies on Nesterov's acceleration technique. For strongly convex loss functions, FRPG and LFRPG have provably faster convergence rates than a benchmark robust aggregation algorithm.
arXiv Detail & Related papers (2020-06-17T16:50:33Z)
Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity. VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.