M22: A Communication-Efficient Algorithm for Federated Learning Inspired
by Rate-Distortion
- URL: http://arxiv.org/abs/2301.09269v1
- Date: Mon, 23 Jan 2023 04:40:01 GMT
- Title: M22: A Communication-Efficient Algorithm for Federated Learning Inspired
by Rate-Distortion
- Authors: Yangyi Liu, Stefano Rini, Sadaf Salehkalaibar, Jun Chen
- Abstract summary: In federated learning, model updates must be compressed so as to minimize the loss in accuracy resulting from a communication constraint.
This paper proposes emph$bf M$-magnitude weighted $L_bf 2$ distortion + $bf 2$ degrees of freedom'' (M22) algorithm, a rate-distortion inspired approach to gradient compression.
- Score: 19.862336286338564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In federated learning (FL), the communication constraint between the remote
learners and the Parameter Server (PS) is a crucial bottleneck. For this
reason, model updates must be compressed so as to minimize the loss in accuracy
resulting from the communication constraint. This paper proposes ``\emph{${\bf
M}$-magnitude weighted $L_{\bf 2}$ distortion + $\bf 2$ degrees of freedom''}
(M22) algorithm, a rate-distortion inspired approach to gradient compression
for federated training of deep neural networks (DNNs). In particular, we
propose a family of distortion measures between the original gradient and the
reconstruction we referred to as ``$M$-magnitude weighted $L_2$'' distortion,
and we assume that gradient updates follow an i.i.d. distribution --
generalized normal or Weibull, which have two degrees of freedom. In both the
distortion measure and the gradient, there is one free parameter for each that
can be fitted as a function of the iteration number. Given a choice of gradient
distribution and distortion measure, we design the quantizer minimizing the
expected distortion in gradient reconstruction. To measure the gradient
compression performance under a communication constraint, we define the
\emph{per-bit accuracy} as the optimal improvement in accuracy that one bit of
communication brings to the centralized model over the training period. Using
this performance measure, we systematically benchmark the choice of gradient
distribution and distortion measure. We provide substantial insights on the
role of these choices and argue that significant performance improvements can
be attained using such a rate-distortion inspired compressor.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Lossy Gradient Compression: How Much Accuracy Can One Bit Buy? [17.907068248604755]
We propose a class of distortion measures for the design of quantizer for the compression of the model updates.
In this paper, we take a rate-distortion approach to answer this question for the distributed training of a deep neural network (DNN)
arXiv Detail & Related papers (2022-02-06T16:29:00Z) - Error-Correcting Neural Networks for Two-Dimensional Curvature
Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method.
Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z) - Optimizing the Communication-Accuracy Trade-off in Federated Learning
with Rate-Distortion Theory [1.5771347525430772]
A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server.
Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution.
Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total gradient and distortion.
arXiv Detail & Related papers (2022-01-07T20:17:33Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - A Cram\'er Distance perspective on Non-crossing Quantile Regression in
Distributional Reinforcement Learning [2.28438857884398]
Quantile-based methods like QR-DQN project arbitrary distributions onto a parametric subset of staircase distributions.
We show that monotonicity constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies.
We propose a novel non-crossing neural architecture that allows a good training performance using a novel algorithm to compute the Cram'er distance.
arXiv Detail & Related papers (2021-10-01T17:00:25Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.