Related papers: DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

URL: http://arxiv.org/abs/2302.12445v2
Date: Thu, 15 Jun 2023 06:19:25 GMT
Title: DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining
Authors: Lin Zhang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li, Chengjian Liu
Abstract summary: Communication scheduling has been shown to be effective in accelerating distributed training. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations. We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
Score: 22.168137965177284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations. This has been commonly adopted in popular distributed deep learning frameworks. However, there exist two fundamental problems: (1) excessive startup latency proportional to the number of workers for each all-reduce operation; (2) it only achieves sub-optimal training performance due to the dependency and synchronization requirement of the feed-forward computation in the next iteration. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations, which overlaps with both backpropagation and feed-forward computations without extra communications. We further design a practical tensor fusion algorithm to improve the training performance. Experimental results with five popular models show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions on a 64-GPU cluster with 10Gb/s Ethernet and 100Gb/s InfiniBand interconnects, respectively.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping [14.435637320909663]
MoE technique plays crucial role in expanding the size of DNN model parameters. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training.
arXiv Detail & Related papers (2024-04-30T10:17:21Z)
Accelerating Distributed Deep Learning using Lossless Homomorphic Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation. We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z)
TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation [53.84175614198885]
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation.
arXiv Detail & Related papers (2023-02-20T08:37:44Z)
Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Compressed Communication [7.691755449724637]
We propose the first algorithm for distributed optimization and federated learning. Our algorithm converges linearly to an exact solution, with a doubly accelerated rate.
arXiv Detail & Related papers (2022-10-24T14:13:54Z)
Collaborative Learning over Wireless Networks: An Introductory Overview [84.09366153693361]
We will mainly focus on collaborative training across wireless devices. Many distributed optimization algorithms have been developed over the last decades. They provide data locality; that is, a joint model can be trained collaboratively while the data available at each participating device remains local.
arXiv Detail & Related papers (2021-12-07T20:15:39Z)
Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models [2.6599014990168834]
Distributed training is a solution to reduce training time by splitting the task across multiple NPUs. Themis is a novel collective scheduling scheme that dynamically schedules collectives to balance the communication loads across all dimensions. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max)
arXiv Detail & Related papers (2021-10-09T06:50:04Z)
Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks [13.552262050816616]
Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration. We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
arXiv Detail & Related papers (2021-07-14T08:01:07Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.