Optimal Gradient Compression for Distributed and Federated Learning
- URL: http://arxiv.org/abs/2010.03246v1
- Date: Wed, 7 Oct 2020 07:58:59 GMT
- Title: Optimal Gradient Compression for Distributed and Federated Learning
- Authors: Alyazeed Albasyoni, Mher Safaryan, Laurent Condat, Peter Richt\'arik
- Abstract summary: Communication between computing nodes in distributed learning is typically an unavoidable burden.
Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques.
In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
- Score: 9.711326718689492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communicating information, like gradient vectors, between computing nodes in
distributed and federated learning is typically an unavoidable burden,
resulting in scalability issues. Indeed, communication might be slow and
costly. Recent advances in communication-efficient training algorithms have
reduced this bottleneck by using compression techniques, in the form of
sparsification, quantization, or low-rank approximation. Since compression is a
lossy, or inexact, process, the iteration complexity is typically worsened; but
the total communication complexity can improve significantly, possibly leading
to large computation time savings. In this paper, we investigate the
fundamental trade-off between the number of bits needed to encode compressed
vectors and the compression error. We perform both worst-case and average-case
analysis, providing tight lower bounds. In the worst-case analysis, we
introduce an efficient compression operator, Sparse Dithering, which is very
close to the lower bound. In the average-case analysis, we design a simple
compression operator, Spherical Compression, which naturally achieves the lower
bound. Thus, our new compression schemes significantly outperform the state of
the art. We conduct numerical experiments to illustrate this improvement.
Related papers
- Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback.
We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate.
The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z) - Compression of Structured Data with Autoencoders: Provable Benefit of
Nonlinearities and Depth [83.15263499262824]
We prove that gradient descent converges to a solution that completely disregards the sparse structure of the input.
We show how to improve upon Gaussian performance for the compression of sparse data by adding a denoising function to a shallow architecture.
We validate our findings on image datasets, such as CIFAR-10 and MNIST.
arXiv Detail & Related papers (2024-02-07T16:32:29Z) - Shifted Compression Framework: Generalizations and Improvements [2.2147691173934967]
Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models.
Lossy compression of exchanged information, such as gradients or models, is one of the most effective instruments to alleviate this issue.
arXiv Detail & Related papers (2022-06-21T15:00:04Z) - Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with
Communication Compression [33.217552987061474]
Communication compression is one of the most effective means of reducing communication.
Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication.
arXiv Detail & Related papers (2022-06-08T03:36:34Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - CD-SGD: Distributed Stochastic Gradient Descent with Compression and
Delay Compensation [3.0786359925181315]
Communication overhead is the key challenge for distributed computation training.
gradient compression technique can greatly alleviate the impact of communication overhead.
However, gradient compression brings in extra cost, which will delay the next training iteration.
arXiv Detail & Related papers (2021-06-21T01:15:12Z) - MergeComp: A Compression Scheduler for Scalable Communication-Efficient
Distributed Training [8.150621147942449]
MergeComp is a compression scheduler to optimize the scalability of communication-efficient distributed training.
It can improve the performance of compression algorithms by up to 3.83x without losing accuracy.
It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.
arXiv Detail & Related papers (2021-03-28T18:26:55Z) - A Linearly Convergent Algorithm for Decentralized Optimization: Sending
Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms.
Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent.
underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z) - Uncertainty Principle for Communication Compression in Distributed and
Federated Learning and the Search for an Optimal Compressor [5.09755285351264]
We consider an unbiased compression method inspired by the Kashin representation of vectors, which we call em Kashin compression (KC).
KC enjoys a em dimension independent variance bound for which we derive an explicit formula even in the regime when only a few bits need to be communicate per each vector entry.
arXiv Detail & Related papers (2020-02-20T17:20:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.