Bidirectional compression in heterogeneous settings for distributed or
federated learning with partial participation: tight convergence guarantees
- URL: http://arxiv.org/abs/2006.14591v4
- Date: Sun, 19 Jun 2022 15:40:37 GMT
- Title: Bidirectional compression in heterogeneous settings for distributed or
federated learning with partial participation: tight convergence guarantees
- Authors: Constantin Philippenko and Aymeric Dieuleveut
- Abstract summary: Artemis is a framework to tackle the problem of learning in a distributed setting with communication constraints and device partial participation.
It improves on existing algorithms that only consider unidirectional compression (to the server), or use very strong assumptions on the compression operator, and often do not take into account devices partial participation.
- Score: 9.31522898261934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a framework - Artemis - to tackle the problem of learning in a
distributed or federated setting with communication constraints and device
partial participation. Several workers (randomly sampled) perform the
optimization process using a central server to aggregate their computations. To
alleviate the communication cost, Artemis allows to compress the information
sent in both directions (from the workers to the server and conversely)
combined with a memory mechanism. It improves on existing algorithms that only
consider unidirectional compression (to the server), or use very strong
assumptions on the compression operator, and often do not take into account
devices partial participation. We provide fast rates of convergence (linear up
to a threshold) under weak assumptions on the stochastic gradients (noise's
variance bounded only at optimal point) in non-i.i.d. setting, highlight the
impact of memory for unidirectional and bidirectional compression, analyze
Polyak-Ruppert averaging. We use convergence in distribution to obtain a lower
bound of the asymptotic variance that highlights practical limits of
compression. We propose two approaches to tackle the challenging case of
devices partial participation and provide experimental results to demonstrate
the validity of our analysis.
Related papers
- Boosting the Performance of Decentralized Federated Learning via Catalyst Acceleration [66.43954501171292]
We introduce Catalyst Acceleration and propose an acceleration Decentralized Federated Learning algorithm called DFedCata.
DFedCata consists of two main components: the Moreau envelope function, which addresses parameter inconsistencies, and Nesterov's extrapolation step, which accelerates the aggregation phase.
Empirically, we demonstrate the advantages of the proposed algorithm in both convergence speed and generalization performance on CIFAR10/100 with various non-iid data distributions.
arXiv Detail & Related papers (2024-10-09T06:17:16Z) - Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback.
We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate.
The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Lower Bounds and Accelerated Algorithms in Distributed Stochastic
Optimization with Communication Compression [31.107056382542417]
Communication compression is an essential strategy for alleviating communication overhead.
We propose NEOLITHIC, a nearly optimal algorithm for compression under mild conditions.
arXiv Detail & Related papers (2023-05-12T17:02:43Z) - Distributed Newton-Type Methods with Communication Compression and
Bernoulli Aggregation [11.870393751095083]
We study ommunication compression and aggregation mechanisms for curvature information.
New 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, require reduced communication and occasional Hessian computations.
For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates.
arXiv Detail & Related papers (2022-06-07T21:12:21Z) - EF-BV: A Unified Theory of Error Feedback and Variance Reduction
Mechanisms for Biased and Unbiased Compression in Distributed Optimization [7.691755449724637]
In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck.
There are two classes of compression operators and separate algorithms making use of them.
We propose a new algorithm, recovering DIANA and EF21 as particular cases.
arXiv Detail & Related papers (2022-05-09T10:44:23Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Compressing gradients by exploiting temporal correlation in momentum-SGD [17.995905582226463]
We analyze compression methods that exploit temporal correlation in systems with and without error-feedback.
Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication.
We prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.
arXiv Detail & Related papers (2021-08-17T18:04:06Z) - A Linearly Convergent Algorithm for Decentralized Optimization: Sending
Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.