MergeComp: A Compression Scheduler for Scalable Communication-Efficient
Distributed Training
- URL: http://arxiv.org/abs/2103.15195v1
- Date: Sun, 28 Mar 2021 18:26:55 GMT
- Title: MergeComp: A Compression Scheduler for Scalable Communication-Efficient
Distributed Training
- Authors: Zhuang Wang, Xinyu Wu, T.S. Eugene Ng
- Abstract summary: MergeComp is a compression scheduler to optimize the scalability of communication-efficient distributed training.
It can improve the performance of compression algorithms by up to 3.83x without losing accuracy.
It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.
- Score: 8.150621147942449
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large-scale distributed training is increasingly becoming communication
bound. Many gradient compression algorithms have been proposed to reduce the
communication overhead and improve scalability. However, it has been observed
that in some cases gradient compression may even harm the performance of
distributed training.
In this paper, we propose MergeComp, a compression scheduler to optimize the
scalability of communication-efficient distributed training. It automatically
schedules the compression operations to optimize the performance of compression
algorithms without the knowledge of model architectures or system parameters.
We have applied MergeComp to nine popular compression algorithms. Our
evaluations show that MergeComp can improve the performance of compression
algorithms by up to 3.83x without losing accuracy. It can even achieve a
scaling factor of distributed training up to 99% over high-speed networks.
Related papers
- LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Accelerating Distributed Deep Learning using Lossless Homomorphic
Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation.
We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z) - CD-SGD: Distributed Stochastic Gradient Descent with Compression and
Delay Compensation [3.0786359925181315]
Communication overhead is the key challenge for distributed computation training.
gradient compression technique can greatly alleviate the impact of communication overhead.
However, gradient compression brings in extra cost, which will delay the next training iteration.
arXiv Detail & Related papers (2021-06-21T01:15:12Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Layer-Wise Data-Free CNN Compression [49.73757297936685]
We show how to generate layer-wise training data using only a pretrained network.
We present results for layer-wise compression using quantization and pruning.
arXiv Detail & Related papers (2020-11-18T03:00:05Z) - Optimal Gradient Compression for Distributed and Federated Learning [9.711326718689492]
Communication between computing nodes in distributed learning is typically an unavoidable burden.
Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques.
In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
arXiv Detail & Related papers (2020-10-07T07:58:59Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.