Accordion: Adaptive Gradient Communication via Critical Learning Regime
Identification
- URL: http://arxiv.org/abs/2010.16248v1
- Date: Thu, 29 Oct 2020 16:41:44 GMT
- Title: Accordion: Adaptive Gradient Communication via Critical Learning Regime
Identification
- Authors: Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman,
Dimitris Papailiopoulos
- Abstract summary: Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes.
To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates.
In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental.
An adaptive compression strategy can reduce communication while maintaining final test accuracy.
- Score: 12.517161466778655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed model training suffers from communication bottlenecks due to
frequent model updates transmitted across compute nodes. To alleviate these
bottlenecks, practitioners use gradient compression techniques like
sparsification, quantization, or low-rank updates. The techniques usually
require choosing a static compression ratio, often requiring users to balance
the trade-off between model accuracy and per-iteration speedup. In this work,
we show that such performance degradation due to choosing a high compression
ratio is not fundamental. An adaptive compression strategy can reduce
communication while maintaining final test accuracy. Inspired by recent
findings on critical learning regimes, in which small gradient errors can have
irrecoverable impact on model performance, we propose Accordion a simple yet
effective adaptive compression algorithm. While Accordion maintains a high
enough compression rate on average, it avoids over-compressing gradients
whenever in critical learning regimes, detected by a simple gradient-norm based
criterion. Our extensive experimental study over a number of machine learning
tasks in distributed environments indicates that Accordion, maintains similar
model accuracy to uncompressed training, yet achieves up to 5.5x better
compression and up to 4.1x end-to-end speedup over static approaches. We show
that Accordion also works for adjusting the batch size, another popular
strategy for alleviating communication bottlenecks.
Related papers
- Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression [13.255861297820326]
gradient compression can reduce communicated gradient data volume.
In practice, gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy.
We identify common issues in previous gradient compression systems and evaluation methodologies.
arXiv Detail & Related papers (2024-07-01T15:32:28Z) - Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback.
We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate.
The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - Optimal Rate Adaption in Federated Learning with Compressed
Communications [28.16239232265479]
Federated Learning incurs high communication overhead, which can be greatly alleviated by compression for model updates.
tradeoff between compression and model accuracy in the networked environment remains unclear.
We present a framework to maximize the final model accuracy by strategically adjusting the compression each iteration.
arXiv Detail & Related papers (2021-12-13T14:26:15Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - Communication-Compressed Adaptive Gradient Method for Distributed
Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers.
Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.