MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and
Accelerating Distributed DNN Training
- URL: http://arxiv.org/abs/2310.00967v3
- Date: Tue, 20 Feb 2024 10:37:11 GMT
- Title: MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and
Accelerating Distributed DNN Training
- Authors: Daegun Yoon, Sangyoon Oh
- Abstract summary: gradient sparsification is a technique for scaling and accelerating distributed deep neural network (DNN) training.
Existing sparsifiers have poor scalability because of the high computational cost of gradient selection.
We propose a novel gradient sparsification method called MiCRO.
In our experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.
- Score: 0.32634122554914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient sparsification is a communication optimisation technique for scaling
and accelerating distributed deep neural network (DNN) training. It reduces the
increasing communication traffic for gradient aggregation. However, existing
sparsifiers have poor scalability because of the high computational cost of
gradient selection and/or increase in communication traffic. In particular, an
increase in communication traffic is caused by gradient build-up and
inappropriate threshold for gradient selection.
To address these challenges, we propose a novel gradient sparsification
method called MiCRO. In MiCRO, the gradient vector is partitioned, and each
partition is assigned to the corresponding worker. Each worker then selects
gradients from its partition, and the aggregated gradients are free from
gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain
the communication traffic as per user requirement by minimising the compression
ratio error. MiCRO enables near-zero cost gradient sparsification by solving
existing problems that hinder the scalability and acceleration of distributed
DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art
sparsifiers with an outstanding convergence rate.
Related papers
- Preserving Near-Optimal Gradient Sparsification Cost for Scalable
Distributed Deep Learning [0.32634122554914]
gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity.
Existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms.
We propose a novel gradient sparsification scheme called ExDyna to address these challenges.
In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance.
arXiv Detail & Related papers (2024-02-21T13:00:44Z) - RS-DGC: Exploring Neighborhood Statistics for Dynamic Gradient
Compression on Remote Sensing Image Interpretation [23.649838489244917]
gradient sparsification has been validated as an effective gradient compression (GC) technique for reducing communication costs.
We propose a simple yet effective dynamic gradient compression scheme leveraging neighborhood statistics indicator for RS image interpretation, RS-DGC.
We achieve an accuracy improvement of 0.51% with more than 50 times communication compression on the NWPU-RESISC45 dataset.
arXiv Detail & Related papers (2023-12-29T09:24:26Z) - GIFD: A Generative Gradient Inversion Method with Feature Domain
Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy.
Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge.
We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z) - DEFT: Exploiting Gradient Norm Difference between Model Layers for
Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning.
We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers.
We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z) - Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification
in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks.
We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD.
The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z) - Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance.
numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Fundamental Limits of Communication Efficiency for Model Aggregation in
Distributed Learning: A Rate-Distortion Approach [54.311495894129585]
We study the limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective.
It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD.
arXiv Detail & Related papers (2022-06-28T13:10:40Z) - Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network.
We introduce a framework (textbfGCGD) to perform gradient correction.
Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.