Shifted Compression Framework: Generalizations and Improvements
- URL: http://arxiv.org/abs/2206.10452v1
- Date: Tue, 21 Jun 2022 15:00:04 GMT
- Title: Shifted Compression Framework: Generalizations and Improvements
- Authors: Egor Shulgin and Peter Richt\'arik
- Abstract summary: Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models.
Lossy compression of exchanged information, such as gradients or models, is one of the most effective instruments to alleviate this issue.
- Score: 2.2147691173934967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Communication is one of the key bottlenecks in the distributed training of
large-scale machine learning models, and lossy compression of exchanged
information, such as stochastic gradients or models, is one of the most
effective instruments to alleviate this issue. Among the most studied
compression techniques is the class of unbiased compression operators with
variance bounded by a multiple of the square norm of the vector we wish to
compress. By design, this variance may remain high, and only diminishes if the
input vector approaches zero. However, unless the model being trained is
overparameterized, there is no a-priori reason for the vectors we wish to
compress to approach zero during the iterations of classical methods such as
distributed compressed {\sf SGD}, which has adverse effects on the convergence
speed. Due to this issue, several more elaborate and seemingly very different
algorithms have been proposed recently, with the goal of circumventing this
issue. These methods are based on the idea of compressing the {\em difference}
between the vector we would normally wish to compress and some auxiliary vector
which changes throughout the iterative process. In this work we take a step
back, and develop a unified framework for studying such methods, conceptually,
and theoretically. Our framework incorporates methods compressing both
gradients and models, using unbiased and biased compressors, and sheds light on
the construction of the auxiliary vectors. Furthermore, our general framework
can lead to the improvement of several existing algorithms, and can produce new
algorithms. Finally, we performed several numerical experiments which
illustrate and support our theoretical findings.
Related papers
- Problem-dependent convergence bounds for randomized linear gradient compression [4.656302602746228]
In distributed optimization, the communication model updates can be a performance bottleneck.
gradient compression has been proposed as a means of increasing optimization.
We study how the impact of compression on throughput can be in terms of the norm of the Hessian objective.
arXiv Detail & Related papers (2024-11-19T22:26:42Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Optimal Compression of Unit Norm Vectors in the High Distortion Regime [30.6205706348233]
We investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery.
Our study considers both biased and unbiased compression methods and determines the optimal compression rates.
While the results are a mix of new and known, they are compiled in this paper for completeness.
arXiv Detail & Related papers (2023-07-16T04:23:57Z) - EF-BV: A Unified Theory of Error Feedback and Variance Reduction
Mechanisms for Biased and Unbiased Compression in Distributed Optimization [7.691755449724637]
In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck.
There are two classes of compression operators and separate algorithms making use of them.
We propose a new algorithm, recovering DIANA and EF21 as particular cases.
arXiv Detail & Related papers (2022-05-09T10:44:23Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Distributed Methods with Absolute Compression and Error Compensation [1.52292571922932]
Communication compression is a powerful approach to alleviating this issue.
In this paper, we generalize the analysis of EC-SGD with absolute compression to the arbitrary sampling strategy.
Our rates improve upon the previously known ones in this setting.
arXiv Detail & Related papers (2022-03-04T15:41:14Z) - Optimal Gradient Compression for Distributed and Federated Learning [9.711326718689492]
Communication between computing nodes in distributed learning is typically an unavoidable burden.
Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques.
In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
arXiv Detail & Related papers (2020-10-07T07:58:59Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms.
Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent.
underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z) - On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings.
We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.