Communication Compression for Distributed Learning without Control Variates
- URL: http://arxiv.org/abs/2412.04538v1
- Date: Thu, 05 Dec 2024 18:46:20 GMT
- Title: Communication Compression for Distributed Learning without Control Variates
- Authors: Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani,
- Abstract summary: Compressed Aggregate Feedback (CAFe) is a novel distributed learning framework that allows highly compressible client updates.
We show that CAFe consistently outperforms distributed learning with direct compression and highlight the compressibility of the client updates with CAFe.
- Score: 38.18168130109547
- License:
- Abstract: Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, which require error feedback to achieve convergence when the compression is aggressive. In turn, error feedback requires client-specific control variates, which directly contradicts privacy-preserving principles and requires stateful clients. In this paper, we propose Compressed Aggregate Feedback (CAFe), a novel distributed learning framework that allows highly compressible client updates by exploiting past aggregated updates, and does not require control variates. We consider Distributed Gradient Descent (DGD) as a representative algorithm and provide a theoretical proof of CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-smooth regime with bounded gradient dissimilarity. Experimental results confirm that CAFe consistently outperforms distributed learning with direct compression and highlight the compressibility of the client updates with CAFe.
Related papers
- Communication-Efficient Distributed Learning with Local Immediate Error
Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm.
LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z) - Fed-CVLC: Compressing Federated Learning Communications with
Variable-Length Codes [54.18186259484828]
In Federated Learning (FL) paradigm, a parameter server (PS) concurrently communicates with distributed participating clients for model collection, update aggregation, and model distribution over multiple rounds.
We show strong evidences that variable-length is beneficial for compression in FL.
We present Fed-CVLC (Federated Learning Compression with Variable-Length Codes), which fine-tunes the code length in response to the dynamics of model updates.
arXiv Detail & Related papers (2024-02-06T07:25:21Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Data-Aware Gradient Compression for FL in Communication-Constrained Mobile Computing [20.70238092277094]
Federated Learning (FL) in mobile environments faces significant communication bottlenecks.
One-size-fits-all compression approach does not account for the varying data volumes across workers.
We propose varying compression ratios to workers with distinct data distributions and volumes.
arXiv Detail & Related papers (2023-11-13T13:24:09Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - Optimal Rate Adaption in Federated Learning with Compressed
Communications [28.16239232265479]
Federated Learning incurs high communication overhead, which can be greatly alleviated by compression for model updates.
tradeoff between compression and model accuracy in the networked environment remains unclear.
We present a framework to maximize the final model accuracy by strategically adjusting the compression each iteration.
arXiv Detail & Related papers (2021-12-13T14:26:15Z) - Communication-Efficient Federated Learning via Robust Distributed Mean
Estimation [16.41391088542669]
Federated learning relies on algorithms such as distributed (mini-batch) SGD, where multiple clients compute their gradients and send them to a central coordinator for averaging and updating the model.
DRIVE is a recent state of the art algorithm that compresses gradients using one bit per coordinate (with some lower-order overhead).
In this technical report, we generalize DRIVE to support any bandwidth constraint as well as extend it to support heterogeneous client resources and make it robust to packet loss.
arXiv Detail & Related papers (2021-08-19T17:59:21Z) - Accordion: Adaptive Gradient Communication via Critical Learning Regime
Identification [12.517161466778655]
Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes.
To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates.
In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental.
An adaptive compression strategy can reduce communication while maintaining final test accuracy.
arXiv Detail & Related papers (2020-10-29T16:41:44Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Domain Adaptation Regularization for Spectral Pruning [44.060724281001775]
Domain Adaptation (DA) addresses this issue by allowing knowledge learned on one labeled source distribution to be transferred to a target distribution, possibly unlabeled.
We show that our method outperforms an existing compression method studied in the DA setting by a large margin for high compression rates.
Although our work is based on one specific compression method, we also outline some general guidelines for improving compression in DA setting.
arXiv Detail & Related papers (2019-12-26T12:38:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.