Federated Random Reshuffling with Compression and Variance Reduction
- URL: http://arxiv.org/abs/2205.03914v2
- Date: Tue, 10 May 2022 08:40:32 GMT
- Title: Federated Random Reshuffling with Compression and Variance Reduction
- Authors: Grigory Malinovsky, Peter Richt\'arik
- Abstract summary: Random Reshuffling (RR) is an immensely popular method for training supervised machine learning models via empirical risk minimization.
It is embedded and often set as default in standard machine learning software.
We introduce three new algorithms to improve FedRR further: one for taming the variance coming from shuffling and the other for taming the variance due to compression.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Random Reshuffling (RR), which is a variant of Stochastic Gradient Descent
(SGD) employing sampling without replacement, is an immensely popular method
for training supervised machine learning models via empirical risk
minimization. Due to its superior practical performance, it is embedded and
often set as default in standard machine learning software. Under the name
FedRR, this method was recently shown to be applicable to federated learning
(Mishchenko et al.,2021), with superior performance when compared to common
baselines such as Local SGD. Inspired by this development, we design three new
algorithms to improve FedRR further: compressed FedRR and two variance reduced
extensions: one for taming the variance coming from shuffling and the other for
taming the variance due to compression. The variance reduction mechanism for
compression allows us to eliminate dependence on the compression parameter, and
applying additional controlled linear perturbations for Random Reshuffling,
introduced by Malinovsky et al.(2021) helps to eliminate variance at the
optimum. We provide the first analysis of compressed local methods under
standard assumptions without bounded gradient assumptions and for heterogeneous
data, overcoming the limitations of the compression operator. We corroborate
our theoretical results with experiments on synthetic and real data sets.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Compression with Exact Error Distribution for Federated Learning [33.74795273515338]
We present and analyze different aggregation schemes based on layered quantizers achieving exact error distribution.
We provide different methods to leverage the proposed compression schemes to obtain compression-for-free in differential privacy applications.
arXiv Detail & Related papers (2023-10-31T17:48:22Z) - Implicit Compressibility of Overparametrized Neural Networks Trained
with Heavy-Tailed SGD [31.61477313262589]
We consider a one-hidden-layer neural network trained with gradient descent (SGD)
We show that if we inject additive heavy-tailed noise to the iterates at each, for any compression rate, there exists a level of overparametrization such that the output of the algorithm will be compressible with high probability.
arXiv Detail & Related papers (2023-06-13T20:37:02Z) - Federated Optimization Algorithms with Random Reshuffling and Gradient
Compression [2.7554288121906296]
We provide the first analysis of methods with gradient compression and without-replacement sampling.
We show how to reduce the variance coming from gradient quantization through the use of control iterates.
We outline several settings in which they improve upon existing algorithms.
arXiv Detail & Related papers (2022-06-14T17:36:47Z) - EF-BV: A Unified Theory of Error Feedback and Variance Reduction
Mechanisms for Biased and Unbiased Compression in Distributed Optimization [7.691755449724637]
In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck.
There are two classes of compression operators and separate algorithms making use of them.
We propose a new algorithm, recovering DIANA and EF21 as particular cases.
arXiv Detail & Related papers (2022-05-09T10:44:23Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Distributed Methods with Absolute Compression and Error Compensation [1.52292571922932]
Communication compression is a powerful approach to alleviating this issue.
In this paper, we generalize the analysis of EC-SGD with absolute compression to the arbitrary sampling strategy.
Our rates improve upon the previously known ones in this setting.
arXiv Detail & Related papers (2022-03-04T15:41:14Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings.
We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.