Fundamental Limits of Communication Efficiency for Model Aggregation in
Distributed Learning: A Rate-Distortion Approach
- URL: http://arxiv.org/abs/2206.13984v1
- Date: Tue, 28 Jun 2022 13:10:40 GMT
- Title: Fundamental Limits of Communication Efficiency for Model Aggregation in
Distributed Learning: A Rate-Distortion Approach
- Authors: Naifu Zhang, Meixia Tao, Jia Wang and Fan Xu
- Abstract summary: We study the limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective.
It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD.
- Score: 54.311495894129585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the main focuses in distributed learning is communication efficiency,
since model aggregation at each round of training can consist of millions to
billions of parameters. Several model compression methods, such as gradient
quantization and sparsification, have been proposed to improve the
communication efficiency of model aggregation. However, the
information-theoretic minimum communication cost for a given distortion of
gradient estimators is still unknown. In this paper, we study the fundamental
limit of communication cost of model aggregation in distributed learning from a
rate-distortion perspective. By formulating the model aggregation as a vector
Gaussian CEO problem, we derive the rate region bound and sum-rate-distortion
function for the model aggregation problem, which reveals the minimum
communication rate at a particular gradient distortion upper bound. We also
analyze the communication cost at each iteration and total communication cost
based on the sum-rate-distortion function with the gradient statistics of
real-world datasets. It is found that the communication gain by exploiting the
correlation between worker nodes is significant for SignSGD, and a high
distortion of gradient estimator can achieve low total communication cost in
gradient compression.
Related papers
- Communication-Efficient Federated Learning through Adaptive Weight
Clustering and Server-Side Distillation [10.541541376305245]
Federated Learning (FL) is a promising technique for the collaborative training of deep neural networks across multiple devices.
FL is hindered by excessive communication costs due to repeated server-client communication during training.
We propose FedCompress, a novel approach that combines dynamic weight clustering and server-side knowledge distillation.
arXiv Detail & Related papers (2024-01-25T14:49:15Z) - Compressed and Sparse Models for Non-Convex Decentralized Learning [6.14375469212514]
Frequent model communication is a significant bottleneck to the efficiency of decentralized machine learning.
We present Malcom-PSGD, a novel decentralized ML algorithm that combines model sparsity with a gradient gradient.
Our method reduces communication costs by approximately $75%$ when compared to the state-of-the-art.
arXiv Detail & Related papers (2023-11-09T21:55:53Z) - Over-the-Air Federated Learning and Optimization [52.5188988624998]
We focus on Federated learning (FL) via edge-the-air computation (AirComp)
We describe the convergence of AirComp-based FedAvg (AirFedAvg) algorithms under both convex and non- convex settings.
For different types of local updates that can be transmitted by edge devices (i.e., model, gradient, model difference), we reveal that transmitting in AirFedAvg may cause an aggregation error.
In addition, we consider more practical signal processing schemes to improve the communication efficiency and extend the convergence analysis to different forms of model aggregation error caused by these signal processing schemes.
arXiv Detail & Related papers (2023-10-16T05:49:28Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance.
numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z) - Quantized Adaptive Subgradient Algorithms and Their Applications [39.103587572626026]
We propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training.
A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity.
arXiv Detail & Related papers (2022-08-11T04:04:03Z) - Data-heterogeneity-aware Mixing for Decentralized Learning [63.83913592085953]
We characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes.
We propose a metric that quantifies the ability of a graph to mix the current gradients.
Motivated by our analysis, we propose an approach that periodically and efficiently optimize the metric.
arXiv Detail & Related papers (2022-04-13T15:54:35Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - CosSGD: Nonlinear Quantization for Communication-efficient Federated
Learning [62.65937719264881]
Federated learning facilitates learning across clients without transferring local data on these clients to a central server.
We propose a nonlinear quantization for compressed gradient descent, which can be easily utilized in federated learning.
Our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process.
arXiv Detail & Related papers (2020-12-15T12:20:28Z) - rTop-k: A Statistical Estimation Approach to Distributed SGD [5.197307534263253]
We show that top-k and random-k sparsification methods consistently and significantly outperforms either method applied alone.
We propose a simple statistical estimation model for gradients which captures the sparsity and statistically optimal communication scheme.
We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the skewd application of these two sparsification methods consistently and significantly outperforms either method applied alone.
arXiv Detail & Related papers (2020-05-21T16:27:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.