Is Network the Bottleneck of Distributed Training?
- URL: http://arxiv.org/abs/2006.10103v3
- Date: Wed, 24 Jun 2020 19:23:26 GMT
- Title: Is Network the Bottleneck of Distributed Training?
- Authors: Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, Raman Arora, Xin Jin
- Abstract summary: We take a first-principles approach to measure and analyze the network performance of distributed training.
We find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one.
- Score: 36.925680383195356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently there has been a surge of research on improving the communication
efficiency of distributed training. However, little work has been done to
systematically understand whether the network is the bottleneck and to what
extent.
In this paper, we take a first-principles approach to measure and analyze the
network performance of distributed training. As expected, our measurement
confirms that communication is the component that blocks distributed training
from linear scale-out. However, contrary to the common belief, we find that the
network is running at low utilization and that if the network can be fully
utilized, distributed training can achieve a scaling factor of close to one.
Moreover, while many recent proposals on gradient compression advocate over
100x compression ratio, we show that under full network utilization, there is
no need for gradient compression in 100 Gbps network. On the other hand, a
lower speed network like 10 Gbps requires only 2x--5x gradients compression
ratio to achieve almost linear scale-out. Compared to application-level
techniques like gradient compression, network-level optimizations do not
require changes to applications and do not hurt the performance of trained
models. As such, we advocate that the real challenge of distributed training is
for the network community to develop high-performance network transport to
fully utilize the network capacity and achieve linear scale-out.
Related papers
- Distributed Training of Large Graph Neural Networks with Variable Communication Rates [71.7293735221656]
Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements.
Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs.
We introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model.
arXiv Detail & Related papers (2024-06-25T14:57:38Z) - Accelerating Distributed Deep Learning using Lossless Homomorphic
Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation.
We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z) - Federated Dynamic Sparse Training: Computing Less, Communicating Less,
Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices.
We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST)
FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Learned Gradient Compression for Distributed Deep Learning [16.892546958602303]
Training deep neural networks on large datasets containing high-dimensional data requires a large amount of computation.
A solution to this problem is data-parallel distributed training, where a model is replicated into several computational nodes that have access to different chunks of the data.
This approach, however, entails high communication rates and latency because of the computed gradients that need to be shared among nodes at every iteration.
arXiv Detail & Related papers (2021-03-16T06:42:36Z) - Moshpit SGD: Communication-Efficient Decentralized Training on
Heterogeneous Unreliable Devices [5.74369902800427]
Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes.
Running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters.
We propose Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.
arXiv Detail & Related papers (2021-03-04T18:58:05Z) - Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy.
We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient.
The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Activation Density driven Energy-Efficient Pruning in Training [2.222917681321253]
We propose a novel pruning method that prunes a network real-time during training.
We obtain exceedingly sparse networks with accuracy comparable to the baseline network.
arXiv Detail & Related papers (2020-02-07T18:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.