Related papers: DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

URL: http://arxiv.org/abs/2007.03298v2
Date: Thu, 13 Jan 2022 03:12:48 GMT
Title: DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training
Authors: Weiyan Wang, Cengguang Zhang, Liu Yang, Kai Chen, Kun Tan
Abstract summary: We present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. We show that DS-Sync can achieve up to $94%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
Score: 15.246142393381488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.

Related papers

Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z)
Shadowheart SGD: Distributed Asynchronous SGD with Optimal Time Complexity Under Arbitrary Computation and Communication Heterogeneity [85.92481138826949]
We develop a new method-Shadowheart SGD-that provably improves the time complexities of all previous centralized methods. We also consider the bidirectional setup, where broadcasting from the server to the workers is non-negligible, and develop a corresponding method.
arXiv Detail & Related papers (2024-02-07T12:15:56Z)
Robust Fully-Asynchronous Methods for Distributed Training over General Architecture [11.480605289411807]
Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose Fully-Asynchronous Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own without any form of impact.
arXiv Detail & Related papers (2023-07-21T14:36:40Z)
Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z)
Semi-Synchronous Personalized Federated Learning over Mobile Edge Networks [88.50555581186799]
We propose a semi-synchronous PFL algorithm, termed as Semi-Synchronous Personalized FederatedAveraging (PerFedS$2$), over mobile edge networks. We derive an upper bound of the convergence rate of PerFedS2 in terms of the number of participants per global round and the number of rounds. Experimental results verify the effectiveness of PerFedS2 in saving training time as well as guaranteeing the convergence of training loss.
arXiv Detail & Related papers (2022-09-27T02:12:43Z)
Receptive Field-based Segmentation for Distributed CNN Inference Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network. We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z)
Edge Continual Learning for Dynamic Digital Twins over Wireless Networks [68.65520952712914]
Digital twins (DTs) constitute a critical link between the real-world and the metaverse. In this paper, a novel edge continual learning framework is proposed to accurately model the evolving affinity between a physical twin and its corresponding cyber twin. The proposed framework achieves a simultaneously accurate and synchronous CT model that is robust to catastrophic forgetting.
arXiv Detail & Related papers (2022-04-10T23:25:37Z)
Locally Asynchronous Stochastic Gradient Descent for Decentralised Deep Learning [0.0]
Local Asynchronous SGD (LASGD) is an asynchronous decentralized algorithm that relies on All Reduce for model synchronization. We empirically validate LASGD's performance on image classification tasks on the ImageNet dataset.
arXiv Detail & Related papers (2022-03-24T14:25:15Z)
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning [10.196574441542646]
Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters. A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol. In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP.
arXiv Detail & Related papers (2021-04-16T20:49:28Z)
Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO) [0.0]
We introduce the Distributed Asynchronous and Selective Optimization (DASO) method to accelerate network training. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks.
arXiv Detail & Related papers (2021-04-12T16:02:20Z)
Asynchronous Decentralized Learning of a Neural Network [49.15799302636519]
We exploit an asynchronous computing framework namely ARock to learn a deep neural network called self-size estimating feedforward neural network (SSFN) in a decentralized scenario. Asynchronous decentralized SSFN relaxes the communication bottleneck by allowing one node activation and one side communication, which reduces the communication overhead significantly. We compare asynchronous dSSFN with traditional synchronous dSSFN in the experimental results, which shows the competitive performance of asynchronous dSSFN, especially when the communication network is sparse.
arXiv Detail & Related papers (2020-04-10T15:53:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.