Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training
- URL: http://arxiv.org/abs/2504.18454v1
- Date: Fri, 25 Apr 2025 16:06:08 GMT
- Title: Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training
- Authors: Hiroki Naganuma, Xinzhi Zhang, Man-Chung Yue, Ioannis Mitliagkas, Philipp A. Witte, Russell J. Hewett, Yin Tat Lee,
- Abstract summary: We propose a method called PseudosynchronousA Local SGD (PALSGD) to improve the efficiency of dataparallel training.<n>PALSGD allows the use of longer synchronization intervals compared to standard Local SGD.<n>Our results show that PALSGD achieves better performance in less time compared to existing methods.
- Score: 25.025458975145757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven development of distributed deep learning methods. Data parallelism is an essential approach to speed up training, but it requires frequent global communication between workers, which can bottleneck training at the largest scales. In this work, we propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training. PALSGD is an extension of Local SGD (Stich, 2018) and DiLoCo (Douillard et al., 2023), designed to further reduce communication frequency by introducing a pseudo-synchronization mechanism. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Despite the reduced communication frequency, the pseudo-synchronization approach ensures that model consistency is maintained, leading to performance results comparable to those achieved with more frequent synchronization. Furthermore, we provide a theoretical analysis of PALSGD, establishing its convergence and deriving its convergence rate. This analysis offers insights into the algorithm's behavior and performance guarantees. We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and DiLoCo. Notably, PALSGD trains 18.4% faster than DDP on ImageNet-1K with ResNet-50, 24.4% faster than DDP on TinyStories with GPT-Neo125M, and 21.1% faster than DDP on TinyStories with GPT-Neo-8M.
Related papers
- Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity [92.1840862558718]
Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous computation times.<n>This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
arXiv Detail & Related papers (2025-01-27T16:07:26Z) - Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z) - Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences.<n>PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads.<n>Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - OSP: Boosting Distributed Model Training with 2-stage Synchronization [24.702780532364056]
We propose a new model synchronization method named Overlapped Parallelization (OSP)
OSP achieves efficient communication with a 2-stage synchronization approach and uses Local-Gradient-based.
correction (LGP) to avoid accuracy loss caused by stale parameters.
Results show that OSP can achieve up to 50% improvement in throughput without accuracy loss compared to popular synchronization models.
arXiv Detail & Related papers (2023-06-29T13:24:12Z) - Efficient Parallel Split Learning over Resource-constrained Wireless
Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL)
We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training.
We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Asynchronous Decentralized Distributed Training of Acoustic Models [43.34839658423581]
We study three variants of asynchronous decentralized parallel SGD (ADPSGD)
We show that ADPSGD with fixed and randomized communication patterns cope well with slow learners.
In particular, using the delay-by-one strategy, we can train the acoustic model in less than 2 hours.
arXiv Detail & Related papers (2021-10-21T15:14:58Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring [18.8426865970643]
A novel Hierarchical Parallel SGD (HPSGD) strategy is proposed to boost the distributed training process of the deep neural network (DNN)
Experiments are conducted to demonstrate that the proposed HPSGD approach substantially boosts the distributed DNN training, reduces the disturbance of the stale gradients and achieves better accuracy in given fixed wall-time.
arXiv Detail & Related papers (2020-09-06T10:17:56Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Improving Efficiency in Large-Scale Decentralized Distributed Training [58.80224380923698]
We propose techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost.
We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task.
arXiv Detail & Related papers (2020-02-04T04:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.