OSP: Boosting Distributed Model Training with 2-stage Synchronization
- URL: http://arxiv.org/abs/2306.16926v2
- Date: Sun, 9 Jul 2023 16:36:43 GMT
- Title: OSP: Boosting Distributed Model Training with 2-stage Synchronization
- Authors: Zixuan Chen, Lei Shi, Xuandong Liu, Jiahui Li, Sen Liu, Yang Xu
- Abstract summary: We propose a new model synchronization method named Overlapped Parallelization (OSP)
OSP achieves efficient communication with a 2-stage synchronization approach and uses Local-Gradient-based.
correction (LGP) to avoid accuracy loss caused by stale parameters.
Results show that OSP can achieve up to 50% improvement in throughput without accuracy loss compared to popular synchronization models.
- Score: 24.702780532364056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed deep learning (DDL) is a promising research area, which aims to
increase the efficiency of training deep learning tasks with large size of
datasets and models. As the computation capability of DDL nodes continues to
increase, the network connection between nodes is becoming a major bottleneck.
Various methods of gradient compression and improved model synchronization have
been proposed to address this bottleneck in Parameter-Server-based DDL.
However, these two types of methods can result in accuracy loss due to
discarded gradients and have limited enhancement on the throughput of model
synchronization, respectively. To address these challenges, we propose a new
model synchronization method named Overlapped Synchronization Parallel (OSP),
which achieves efficient communication with a 2-stage synchronization approach
and uses Local-Gradient-based Parameter correction (LGP) to avoid accuracy loss
caused by stale parameters. The prototype of OSP has been implemented using
PyTorch and evaluated on commonly used deep learning models and datasets with a
9-node testbed. Evaluation results show that OSP can achieve up to 50\%
improvement in throughput without accuracy loss compared to popular
synchronization models.
Related papers
- Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - Boosting Asynchronous Decentralized Learning with Model Fragmentation [1.6053176639259055]
DivShare is a novel DL algorithm that achieves fast model convergence in the presence of communication stragglers.
We experimentally evaluate DivShare against two state-of-the-art DL baselines, AD-PSGD and Swift.
We find that DivShare with communication stragglers lowers time-to-accuracy by up to 3.9x compared to AD-PSGD on the CIFAR-10 dataset.
arXiv Detail & Related papers (2024-10-16T18:03:52Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
One major shortcoming of backpropagation is the interlocking between the forward and backward phases of the algorithm.
We propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads.
We show that this approach yields close to state-of-theart results while running up to 2.97x faster than Hogwild! scaled on multiple devices.
arXiv Detail & Related papers (2024-10-08T12:32:36Z) - FedDIP: Federated Learning with Extreme Dynamic Pruning and Incremental
Regularization [5.182014186927254]
Federated Learning (FL) has been successfully adopted for distributed training and inference of large-scale Deep Neural Networks (DNNs)
We contribute with a novel FL framework (coined FedDIP) which combines (i) dynamic model pruning with error feedback to eliminate redundant information exchange.
We provide convergence analysis of FedDIP and report on a comprehensive performance and comparative assessment against state-of-the-art methods.
arXiv Detail & Related papers (2023-09-13T08:51:19Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices.
We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time.
Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z) - HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring [18.8426865970643]
A novel Hierarchical Parallel SGD (HPSGD) strategy is proposed to boost the distributed training process of the deep neural network (DNN)
Experiments are conducted to demonstrate that the proposed HPSGD approach substantially boosts the distributed DNN training, reduces the disturbance of the stale gradients and achieves better accuracy in given fixed wall-time.
arXiv Detail & Related papers (2020-09-06T10:17:56Z) - PSO-PS: Parameter Synchronization with Particle Swarm Optimization for
Distributed Training of Deep Neural Networks [16.35607080388805]
We propose a new algorithm of integrating Particle Swarm Optimization into the distributed training process of Deep Neural Networks (DNNs)
In the proposed algorithm, a computing work is encoded by a particle, the weights of DNNs and the training loss are modeled by the particle attributes.
At each synchronization stage, the weights are updated by PSO from the sub weights gathered from all workers, instead of averaging the weights or the gradients.
arXiv Detail & Related papers (2020-09-06T05:18:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Recent Developments Combining Ensemble Smoother and Deep Generative
Networks for Facies History Matching [58.720142291102135]
This research project focuses on the use of autoencoders networks to construct a continuous parameterization for facies models.
We benchmark seven different formulations, including VAE, generative adversarial network (GAN), Wasserstein GAN, variational auto-encoding GAN, principal component analysis (PCA) with cycle GAN, PCA with transfer style network and VAE with style loss.
arXiv Detail & Related papers (2020-05-08T21:32:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.