Related papers: Shifted Compression Framework: Generalizations and Improvements

Shifted Compression Framework: Generalizations and Improvements

URL: http://arxiv.org/abs/2206.10452v1
Date: Tue, 21 Jun 2022 15:00:04 GMT
Title: Shifted Compression Framework: Generalizations and Improvements
Authors: Egor Shulgin and Peter Richt\'arik
Abstract summary: Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models. Lossy compression of exchanged information, such as gradients or models, is one of the most effective instruments to alleviate this issue.
Score: 2.2147691173934967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed {\sf SGD}, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the {\em difference} between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings.

Related papers

Problem-dependent convergence bounds for randomized linear gradient compression [4.656302602746228]
In distributed optimization, the communication model updates can be a performance bottleneck. gradient compression has been proposed as a means of increasing optimization. We study how the impact of compression on throughput can be in terms of the norm of the Hessian objective.
arXiv Detail & Related papers (2024-11-19T22:26:42Z)
Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Optimal Compression of Unit Norm Vectors in the High Distortion Regime [30.6205706348233]
We investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery. Our study considers both biased and unbiased compression methods and determines the optimal compression rates. While the results are a mix of new and known, they are compiled in this paper for completeness.
arXiv Detail & Related papers (2023-07-16T04:23:57Z)
EF-BV: A Unified Theory of Error Feedback and Variance Reduction Mechanisms for Biased and Unbiased Compression in Distributed Optimization [7.691755449724637]
In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck. There are two classes of compression operators and separate algorithms making use of them. We propose a new algorithm, recovering DIANA and EF21 as particular cases.
arXiv Detail & Related papers (2022-05-09T10:44:23Z)
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression. We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective. Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z)
Distributed Methods with Absolute Compression and Error Compensation [1.52292571922932]
Communication compression is a powerful approach to alleviating this issue. In this paper, we generalize the analysis of EC-SGD with absolute compression to the arbitrary sampling strategy. Our rates improve upon the previously known ones in this setting.
arXiv Detail & Related papers (2022-03-04T15:41:14Z)
Optimal Gradient Compression for Distributed and Federated Learning [9.711326718689492]
Communication between computing nodes in distributed learning is typically an unavoidable burden. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques. In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
arXiv Detail & Related papers (2020-10-07T07:58:59Z)
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers. Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent. underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z)
On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)
Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor [5.09755285351264]
We consider an unbiased compression method inspired by the Kashin representation of vectors, which we call em Kashin compression (KC). KC enjoys a em dimension independent variance bound for which we derive an explicit formula even in the regime when only a few bits need to be communicate per each vector entry.
arXiv Detail & Related papers (2020-02-20T17:20:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.