Related papers: Optimal Gradient Compression for Distributed and Federated Learning

Optimal Gradient Compression for Distributed and Federated Learning

URL: http://arxiv.org/abs/2010.03246v1
Date: Wed, 7 Oct 2020 07:58:59 GMT
Title: Optimal Gradient Compression for Distributed and Federated Learning
Authors: Alyazeed Albasyoni, Mher Safaryan, Laurent Condat, Peter Richt\'arik
Abstract summary: Communication between computing nodes in distributed learning is typically an unavoidable burden. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques. In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
Score: 9.711326718689492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Communicating information, like gradient vectors, between computing nodes in distributed and federated learning is typically an unavoidable burden, resulting in scalability issues. Indeed, communication might be slow and costly. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques, in the form of sparsification, quantization, or low-rank approximation. Since compression is a lossy, or inexact, process, the iteration complexity is typically worsened; but the total communication complexity can improve significantly, possibly leading to large computation time savings. In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error. We perform both worst-case and average-case analysis, providing tight lower bounds. In the worst-case analysis, we introduce an efficient compression operator, Sparse Dithering, which is very close to the lower bound. In the average-case analysis, we design a simple compression operator, Spherical Compression, which naturally achieves the lower bound. Thus, our new compression schemes significantly outperform the state of the art. We conduct numerical experiments to illustrate this improvement.

Related papers

Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback. We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate. The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z)
Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth [83.15263499262824]
We prove that gradient descent converges to a solution that completely disregards the sparse structure of the input. We show how to improve upon Gaussian performance for the compression of sparse data by adding a denoising function to a shallow architecture. We validate our findings on image datasets, such as CIFAR-10 and MNIST.
arXiv Detail & Related papers (2024-02-07T16:32:29Z)
Shifted Compression Framework: Generalizations and Improvements [2.2147691173934967]
Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models. Lossy compression of exchanged information, such as gradients or models, is one of the most effective instruments to alleviate this issue.
arXiv Detail & Related papers (2022-06-21T15:00:04Z)
Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression [33.217552987061474]
Communication compression is one of the most effective means of reducing communication. Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication.
arXiv Detail & Related papers (2022-06-08T03:36:34Z)
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression. We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective. Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z)
CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation [3.0786359925181315]
Communication overhead is the key challenge for distributed computation training. gradient compression technique can greatly alleviate the impact of communication overhead. However, gradient compression brings in extra cost, which will delay the next training iteration.
arXiv Detail & Related papers (2021-06-21T01:15:12Z)
MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training [8.150621147942449]
MergeComp is a compression scheduler to optimize the scalability of communication-efficient distributed training. It can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.
arXiv Detail & Related papers (2021-03-28T18:26:55Z)
A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator. We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators. We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z)
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers. Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent. underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z)
Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor [5.09755285351264]
We consider an unbiased compression method inspired by the Kashin representation of vectors, which we call em Kashin compression (KC). KC enjoys a em dimension independent variance bound for which we derive an explicit formula even in the regime when only a few bits need to be communicate per each vector entry.
arXiv Detail & Related papers (2020-02-20T17:20:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.