Communication-Efficient Distributed Deep Learning: A Comprehensive
Survey
- URL: http://arxiv.org/abs/2003.06307v2
- Date: Fri, 1 Sep 2023 11:18:38 GMT
- Title: Communication-Efficient Distributed Deep Learning: A Comprehensive
Survey
- Authors: Zhenheng Tang, Shaohuai Shi, Wei Wang, Bo Li, Xiaowen Chu
- Abstract summary: We provide a comprehensive survey of the communication-efficient distributed training algorithms.
We first propose a taxonomy of data-parallel distributed training algorithms.
We then investigate state-of-the-art studies that address problems in these four dimensions.
- Score: 22.42450750097714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed deep learning (DL) has become prevalent in recent years to reduce
training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to
larger models and datasets. However, system scalability is limited by
communication becoming the performance bottleneck. Addressing this
communication issue has become a prominent research topic. In this paper, we
provide a comprehensive survey of the communication-efficient distributed
training algorithms, focusing on both system-level and algorithmic-level
optimizations. We first propose a taxonomy of data-parallel distributed
training algorithms that incorporates four primary dimensions: communication
synchronization, system architectures, compression techniques, and parallelism
of communication and computing tasks. We then investigate state-of-the-art
studies that address problems in these four dimensions. We also compare the
convergence rates of different algorithms to understand their convergence
speed. Additionally, we conduct extensive experiments to empirically compare
the convergence performance of various mainstream distributed training
algorithms. Based on our system-level communication cost analysis, theoretical
and experimental convergence speed comparison, we provide readers with an
understanding of which algorithms are more efficient under specific distributed
environments. Our research also extrapolates potential directions for further
optimizations.
Related papers
- Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey [43.57122822150023]
This article surveys the literature on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning.
We first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training.
Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference.
arXiv Detail & Related papers (2024-04-09T08:35:04Z) - Asynchronous Local Computations in Distributed Bayesian Learning [8.516532665507835]
We propose gossip-based communication to leverage fast computations and reduce communication overhead simultaneously.
We observe faster initial convergence and improved performance accuracy, especially in the low data range.
We achieve on average 78% and over 90% classification accuracy respectively on the Gamma Telescope and mHealth data sets from the UCI ML repository.
arXiv Detail & Related papers (2023-11-06T20:11:41Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - On the Convergence of Distributed Stochastic Bilevel Optimization
Algorithms over a Network [55.56019538079826]
Bilevel optimization has been applied to a wide variety of machine learning models.
Most existing algorithms restrict their single-machine setting so that they are incapable of handling distributed data.
We develop novel decentralized bilevel optimization algorithms based on a gradient tracking communication mechanism and two different gradients.
arXiv Detail & Related papers (2022-06-30T05:29:52Z) - AsySQN: Faster Vertical Federated Learning Algorithms with Better
Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL)
The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly.
We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z) - A Quantitative Survey of Communication Optimizations in Distributed Deep
Learning [19.514207840069616]
Large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines.
Extensive communications between workers pose serious scaling problems.
We present a quantitative survey of communication optimization techniques for data parallel distributed DL.
arXiv Detail & Related papers (2020-05-27T09:12:48Z) - Scaling-up Distributed Processing of Data Streams for Machine Learning [10.581140430698103]
This paper reviews recently developed methods that focus on large-scale distributed optimization in the compute- and bandwidth-limited regime.
It focuses on methods that solve: (i) distributed convex problems, and (ii) distributed principal component analysis, which is a non problem with geometric structure that permits global convergence.
arXiv Detail & Related papers (2020-05-18T16:28:54Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z) - Distributed Learning in the Non-Convex World: From Batch to Streaming
Data, and Beyond [73.03743482037378]
Distributed learning has become a critical direction of the massively connected world envisioned by many.
This article discusses four key elements of scalable distributed processing and real-time data computation problems.
Practical issues and future research will also be discussed.
arXiv Detail & Related papers (2020-01-14T14:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.