Related papers: Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning

Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning

URL: http://arxiv.org/abs/2001.01347v1
Date: Mon, 6 Jan 2020 01:05:50 GMT
Title: Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning
Authors: Xing Zhao, Manos Papagelis, Aijun An, Bao Xin Chen, Junfeng Liu, Yonggang Hu
Abstract summary: The proposed model offers more flexibility and adaptability during the training phase, without sacrificing on the accuracy of the trained model. A thorough experimental evaluation demonstrates that our proposed ELASTICBSP model converges faster and to a higher accuracy than the classic BSP.
Score: 17.798727574458514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of machine learning models. A prevalent shortcoming of the BSP is that it requires workers to wait for the straggler at every iteration. To ameliorate this shortcoming of classic BSP, we propose ELASTICBSP a model that aims to relax its strict synchronization requirement. The proposed model offers more flexibility and adaptability during the training phase, without sacrificing on the accuracy of the trained model. We also propose an efficient method that materializes the model, named ZIPLINE. The algorithm is tunable and can effectively balance the trade-off between quality of convergence and iteration throughput, in order to accommodate different environments or applications. A thorough experimental evaluation demonstrates that our proposed ELASTICBSP model converges faster and to a higher accuracy than the classic BSP. It also achieves comparable (if not higher) accuracy than the other sensible synchronization models.

Related papers

Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes [7.998815625852598]
We introduce a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding.<n>Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations.<n>We detail a robust Reinforcement Learning Fine-Tuning pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse.
arXiv Detail & Related papers (2025-12-11T00:51:18Z)
Efficient Federated Learning with Timely Update Dissemination [54.668309196009204]
Federated Learning (FL) has emerged as a compelling methodology for the management of distributed data.<n>We propose an efficient FL approach that capitalizes on additional downlink bandwidth resources to ensure timely update dissemination.
arXiv Detail & Related papers (2025-07-08T14:34:32Z)
Corrected with the Latest Version: Make Robust Asynchronous Federated Learning Possible [2.663489028501814]
This paper proposes an asynchronous federated learning version correction algorithm based on knowledge distillation, named FedADT. FedADT applies knowledge distillation before aggregating gradients, using the latest global model to correct outdated information, thus effectively reducing the negative impact of outdated gradients on the training process. We conducted experimental comparisons with several classical algorithms, and the results demonstrate that FedADT achieves significant improvements over other asynchronous methods and outperforms all methods in terms of convergence speed.
arXiv Detail & Related papers (2025-04-05T06:54:13Z)
FedAST: Federated Asynchronous Simultaneous Training [27.492821176616815]
Federated Learning (FL) enables devices or clients to collaboratively train machine learning (ML) models without sharing their private data. Much of the existing work in FL focuses on efficiently learning a model for a single task. In this paper, we propose simultaneous training of multiple FL models using a common set of datasets.
arXiv Detail & Related papers (2024-06-01T05:14:20Z)
Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z)
Federated Learning based on Pruning and Recovery [0.0]
This framework integrates asynchronous learning algorithms and pruning techniques. It addresses the inefficiencies of traditional federated learning algorithms in scenarios involving heterogeneous devices. It also tackles the staleness issue and inadequate training of certain clients in asynchronous algorithms.
arXiv Detail & Related papers (2024-03-16T14:35:03Z)
OSP: Boosting Distributed Model Training with 2-stage Synchronization [24.702780532364056]
We propose a new model synchronization method named Overlapped Parallelization (OSP) OSP achieves efficient communication with a 2-stage synchronization approach and uses Local-Gradient-based. correction (LGP) to avoid accuracy loss caused by stale parameters. Results show that OSP can achieve up to 50% improvement in throughput without accuracy loss compared to popular synchronization models.
arXiv Detail & Related papers (2023-06-29T13:24:12Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Stochastic Coded Federated Learning: Theoretical Analysis and Incentive Mechanism Design [18.675244280002428]
We propose a novel FL framework named coded federated learning (SCFL) that leverages coded computing techniques. In SCFL, each edge device uploads a privacy-preserving coded dataset to the server, which is generated by adding noise to the projected local dataset. We show that SCFL learns a better model within the given time and achieves a better privacy-performance tradeoff than the baseline methods.
arXiv Detail & Related papers (2022-11-08T09:58:36Z)
Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout [22.584080337157168]
Asynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup. We propose textttAsyncDrop, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings. Overall, textttAsyncDrop achieves better performance compared to state of the art asynchronous methodologies.
arXiv Detail & Related papers (2022-10-28T13:00:29Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Device Scheduling and Update Aggregation Policies for Asynchronous Federated Learning [72.78668894576515]
Federated Learning (FL) is a newly emerged decentralized machine learning (ML) framework. We propose an asynchronous FL framework with periodic aggregation to eliminate the straggler issue in FL systems.
arXiv Detail & Related papers (2021-07-23T18:57:08Z)
Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments. In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models. Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.