A Quantitative Survey of Communication Optimizations in Distributed Deep
Learning
- URL: http://arxiv.org/abs/2005.13247v2
- Date: Sat, 7 Nov 2020 07:05:33 GMT
- Title: A Quantitative Survey of Communication Optimizations in Distributed Deep
Learning
- Authors: Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo
Li
- Abstract summary: Large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines.
Extensive communications between workers pose serious scaling problems.
We present a quantitative survey of communication optimization techniques for data parallel distributed DL.
- Score: 19.514207840069616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, large and complex deep learning (DL) models are increasingly
trained in a distributed manner across multiple worker machines, in which
extensive communications between workers pose serious scaling problems. In this
article, we present a quantitative survey of communication optimization
techniques for data parallel distributed DL. We first identify the major
communication challenges and classify the existing solutions into three levels,
namely the learning algorithm, the system architecture, and the network
infrastructure. We present the state-of-the-art communication optimization
techniques and conduct a comparative study of seven common lossless distributed
DL methods on a 32-GPU cluster with 100Gbps InfiniBand (IB). We show that (1)
the DL models with low model intensity (such as BERT and BERT-Large) are
difficult to scale out even with the best available lossless algorithm over
100Gbps IB; (2) the system architecture and scheduling algorithms have a
critical impact on the scaling property. We conclude the article with
discussions on the open issues for further investigations.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Rephrase and Contrast: Fine-Tuning Language Models for Enhanced Understanding of Communication and Computer Networks [13.829525575305206]
This paper introduces our Rephrase and Contrast (RaC) framework, an efficient fine-tuning framework.
RaC enhances LLMs' comprehension and critical thinking abilities by incorporating question reformulation and contrastive analysis.
To efficiently construct the dataset for RaC fine-tuning, we develop a GPT-assisted data mining method for generating high-quality question-answer pairs.
arXiv Detail & Related papers (2024-09-21T16:04:43Z) - Overlay-based Decentralized Federated Learning in Bandwidth-limited Networks [3.9162099309900835]
Decentralized federated learning (DFL) has the promise of boosting the deployment of artificial intelligence (AI) by directly learning across distributed agents without centralized coordination.
Most existing solutions were based on the simplistic assumption that neighboring agents are physically adjacent in the underlying communication network.
We jointly design the communication demands and the communication schedule for overlay-based DFL in bandwidth-limited networks without requiring explicit cooperation from the underlying network.
arXiv Detail & Related papers (2024-08-08T18:05:11Z) - Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey [43.57122822150023]
This article surveys the literature on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning.
We first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training.
Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference.
arXiv Detail & Related papers (2024-04-09T08:35:04Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Multi-agent Communication with Graph Information Bottleneck under
Limited Bandwidth (a position paper) [92.11330289225981]
In many real-world scenarios, communication can be expensive and the bandwidth of the multi-agent system is subject to certain constraints.
Redundant messages who occupy the communication resources can block the transmission of informative messages and thus jeopardize the performance.
We propose a novel multi-agent communication module, CommGIB, which effectively compresses the structure information and node information in the communication graph to deal with bandwidth-constrained settings.
arXiv Detail & Related papers (2021-12-20T07:53:44Z) - Federated Learning over Wireless IoT Networks with Optimized
Communication and Resources [98.18365881575805]
Federated learning (FL) as a paradigm of collaborative learning techniques has obtained increasing research attention.
It is of interest to investigate fast responding and accurate FL schemes over wireless systems.
We show that the proposed communication-efficient federated learning framework converges at a strong linear rate.
arXiv Detail & Related papers (2021-10-22T13:25:57Z) - A Tutorial on Ultra-Reliable and Low-Latency Communications in 6G:
Integrating Domain Knowledge into Deep Learning [115.75967665222635]
Ultra-reliable and low-latency communications (URLLC) will be central for the development of various emerging mission-critical applications.
Deep learning algorithms have been considered as promising ways of developing enabling technologies for URLLC in future 6G networks.
This tutorial illustrates how domain knowledge can be integrated into different kinds of deep learning algorithms for URLLC.
arXiv Detail & Related papers (2020-09-13T14:53:01Z) - Communication-Efficient Distributed Deep Learning: A Comprehensive
Survey [22.42450750097714]
We provide a comprehensive survey of the communication-efficient distributed training algorithms.
We first propose a taxonomy of data-parallel distributed training algorithms.
We then investigate state-of-the-art studies that address problems in these four dimensions.
arXiv Detail & Related papers (2020-03-10T05:42:44Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.