Related papers: Boosting Asynchronous Decentralized Learning with Model Fragmentation

Boosting Asynchronous Decentralized Learning with Model Fragmentation

URL: http://arxiv.org/abs/2410.12918v1
Date: Wed, 16 Oct 2024 18:03:52 GMT
Title: Boosting Asynchronous Decentralized Learning with Model Fragmentation
Authors: Sayan Biswas, Anne-Marie Kermarrec, Alexis Marouani, Rafael Pires, Rishi Sharma, Martijn De Vos,
Abstract summary: DivShare is a novel DL algorithm that achieves fast model convergence in the presence of communication stragglers. We experimentally evaluate DivShare against two state-of-the-art DL baselines, AD-PSGD and Swift. We find that DivShare with communication stragglers lowers time-to-accuracy by up to 3.9x compared to AD-PSGD on the CIFAR-10 dataset.
Score: 1.6053176639259055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decentralized learning (DL) is an emerging technique that allows nodes on the web to collaboratively train machine learning models without sharing raw data. Dealing with stragglers, i.e., nodes with slower compute or communication than others, is a key challenge in DL. We present DivShare, a novel asynchronous DL algorithm that achieves fast model convergence in the presence of communication stragglers. DivShare achieves this by having nodes fragment their models into parameter subsets and send, in parallel to computation, each subset to a random sample of other nodes instead of sequentially exchanging full models. The transfer of smaller fragments allows more efficient usage of the collective bandwidth and enables nodes with slow network links to quickly contribute with at least some of their model parameters. By theoretically proving the convergence of DivShare, we provide, to the best of our knowledge, the first formal proof of convergence for a DL algorithm that accounts for the effects of asynchronous communication with delays. We experimentally evaluate DivShare against two state-of-the-art DL baselines, AD-PSGD and Swift, and with two standard datasets, CIFAR-10 and MovieLens. We find that DivShare with communication stragglers lowers time-to-accuracy by up to 3.9x compared to AD-PSGD on the CIFAR-10 dataset. Compared to baselines, DivShare also achieves up to 19.4% better accuracy and 9.5% lower test loss on the CIFAR-10 and MovieLens datasets, respectively.

Related papers

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads. Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging [1.4748100900619232]
Federated Dynamic Averaging (FDA) is a communication-efficient DDL strategy. FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge algorithms.
arXiv Detail & Related papers (2024-05-31T16:34:11Z)
Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization [65.85963235502322]
Federated Adam (FedAdam) algorithms suffer from a threefold increase in uplink communication overhead. We propose a novel sparse FedAdam algorithm called FedAdam-SSM, wherein distributed devices sparsify the updates local model parameters and moment estimates. By minimizing the divergence bound between the model trained by FedAdam-SSM and centralized Adam, we optimize the SSM to mitigate the learning performance degradation caused by sparsification error.
arXiv Detail & Related papers (2024-05-28T07:56:49Z)
Communication-Efficient Decentralized Federated Learning via One-Bit Compressive Sensing [52.402550431781805]
Decentralized federated learning (DFL) has gained popularity due to its practicality across various applications. Compared to the centralized version, training a shared model among a large number of nodes in DFL is more challenging. We develop a novel algorithm based on the framework of the inexact alternating direction method (iADM)
arXiv Detail & Related papers (2023-08-31T12:22:40Z)
OSP: Boosting Distributed Model Training with 2-stage Synchronization [24.702780532364056]
We propose a new model synchronization method named Overlapped Parallelization (OSP) OSP achieves efficient communication with a 2-stage synchronization approach and uses Local-Gradient-based. correction (LGP) to avoid accuracy loss caused by stale parameters. Results show that OSP can achieve up to 50% improvement in throughput without accuracy loss compared to popular synchronization models.
arXiv Detail & Related papers (2023-06-29T13:24:12Z)
Communication-Efficient Federated Learning With Data and Client Heterogeneity [22.432529149142976]
Federated Learning (FL) enables large-scale distributed training of machine learning models. executing FL at scale comes with inherent practical challenges. We present the first variant of the classic federated averaging (FedAvg) algorithm.
arXiv Detail & Related papers (2022-06-20T22:39:39Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Two-Bit Aggregation for Communication Efficient and Differentially Private Federated Learning [79.66767935077925]
In federated learning (FL), a machine learning model is trained on multiple nodes in a decentralized manner, while keeping the data local and not shared with other nodes. The information sent from the nodes to the server may reveal some details about each node's local data, thus raising privacy concerns. A novel two-bit aggregation algorithm is proposed with guaranteed differential privacy and reduced uplink communication overhead.
arXiv Detail & Related papers (2021-10-06T19:03:58Z)
Federated Action Recognition on Heterogeneous Embedded Devices [16.88104153104136]
In this work, we enable clients with limited computing power to perform action recognition, a computationally heavy task. We first perform model compression at the central server through knowledge distillation on a large dataset. The fine-tuning is required because limited data present in smaller datasets is not adequate for action recognition models to learn complextemporal features.
arXiv Detail & Related papers (2021-07-18T02:33:24Z)
Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed Learning over Directed & Time-Varying Graphs with non-IID Datasets [2.518955020930418]
We propose Sparse-Push, a communication efficient decentralized distributed training algorithm. The proposed algorithm enables 466x reduction in communication with only 1% degradation in performance. We demonstrate how communication compression can lead to significant performance degradation in-case of non-IID datasets.
arXiv Detail & Related papers (2021-02-10T19:41:11Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.