A Quantitative Survey of Communication Optimizations in Distributed Deep
Learning
- URL: http://arxiv.org/abs/2005.13247v2
- Date: Sat, 7 Nov 2020 07:05:33 GMT
- Title: A Quantitative Survey of Communication Optimizations in Distributed Deep
Learning
- Authors: Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo
Li
- Abstract summary: Large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines.
Extensive communications between workers pose serious scaling problems.
We present a quantitative survey of communication optimization techniques for data parallel distributed DL.
- Score: 19.514207840069616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, large and complex deep learning (DL) models are increasingly
trained in a distributed manner across multiple worker machines, in which
extensive communications between workers pose serious scaling problems. In this
article, we present a quantitative survey of communication optimization
techniques for data parallel distributed DL. We first identify the major
communication challenges and classify the existing solutions into three levels,
namely the learning algorithm, the system architecture, and the network
infrastructure. We present the state-of-the-art communication optimization
techniques and conduct a comparative study of seven common lossless distributed
DL methods on a 32-GPU cluster with 100Gbps InfiniBand (IB). We show that (1)
the DL models with low model intensity (such as BERT and BERT-Large) are
difficult to scale out even with the best available lossless algorithm over
100Gbps IB; (2) the system architecture and scheduling algorithms have a
critical impact on the scaling property. We conclude the article with
discussions on the open issues for further investigations.
Related papers
- Local Methods with Adaptivity via Scaling [71.11111992280566]
This paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods.
We consider the classical Local SGD method and enhance it with a scaling feature.
In addition to theoretical analysis, we validate the performance of our methods in practice by training a neural network.
arXiv Detail & Related papers (2024-06-02T19:50:05Z) - Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey [43.57122822150023]
This article surveys the literature on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning.
We first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training.
Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference.
arXiv Detail & Related papers (2024-04-09T08:35:04Z) - Multi Agent DeepRL based Joint Power and Subchannel Allocation in IAB
networks [0.0]
Integrated Access and Backhauling (IRL) is a viable approach for meeting the unprecedented need for higher data rates of future generations.
In this paper, we show how we can use Deep Q-Learning Network to handle problems with huge action spaces associated with fractional nodes.
arXiv Detail & Related papers (2023-08-31T21:30:25Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Multi-agent Communication with Graph Information Bottleneck under
Limited Bandwidth (a position paper) [92.11330289225981]
In many real-world scenarios, communication can be expensive and the bandwidth of the multi-agent system is subject to certain constraints.
Redundant messages who occupy the communication resources can block the transmission of informative messages and thus jeopardize the performance.
We propose a novel multi-agent communication module, CommGIB, which effectively compresses the structure information and node information in the communication graph to deal with bandwidth-constrained settings.
arXiv Detail & Related papers (2021-12-20T07:53:44Z) - Intrinisic Gradient Compression for Federated Learning [3.9215337270154995]
Federated learning enables a large number of clients to jointly train a machine learning model on privately-held data.
One of the largest barriers to wider adoption of federated learning is the communication cost of sending model updates from and to the clients.
arXiv Detail & Related papers (2021-12-05T19:16:54Z) - Federated Learning over Wireless IoT Networks with Optimized
Communication and Resources [98.18365881575805]
Federated learning (FL) as a paradigm of collaborative learning techniques has obtained increasing research attention.
It is of interest to investigate fast responding and accurate FL schemes over wireless systems.
We show that the proposed communication-efficient federated learning framework converges at a strong linear rate.
arXiv Detail & Related papers (2021-10-22T13:25:57Z) - A Tutorial on Ultra-Reliable and Low-Latency Communications in 6G:
Integrating Domain Knowledge into Deep Learning [115.75967665222635]
Ultra-reliable and low-latency communications (URLLC) will be central for the development of various emerging mission-critical applications.
Deep learning algorithms have been considered as promising ways of developing enabling technologies for URLLC in future 6G networks.
This tutorial illustrates how domain knowledge can be integrated into different kinds of deep learning algorithms for URLLC.
arXiv Detail & Related papers (2020-09-13T14:53:01Z) - Communication-Efficient Distributed Deep Learning: A Comprehensive
Survey [22.42450750097714]
We provide a comprehensive survey of the communication-efficient distributed training algorithms.
We first propose a taxonomy of data-parallel distributed training algorithms.
We then investigate state-of-the-art studies that address problems in these four dimensions.
arXiv Detail & Related papers (2020-03-10T05:42:44Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.