Related papers: MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training

MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training

URL: http://arxiv.org/abs/2310.00967v3
Date: Tue, 20 Feb 2024 10:37:11 GMT
Title: MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training
Authors: Daegun Yoon, Sangyoon Oh
Abstract summary: gradient sparsification is a technique for scaling and accelerating distributed deep neural network (DNN) training. Existing sparsifiers have poor scalability because of the high computational cost of gradient selection. We propose a novel gradient sparsification method called MiCRO. In our experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.
Score: 0.32634122554914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.

Related papers

Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies [53.38648279089736]
Geminet is a lightweight and scalable ML-based TE framework that can handle changing topologies.<n>Its neural network size is only 0.04% to 7% of existing schemes.<n>When trained on large-scale topologies, Geminet consumes under 10 GiB of memory, more than eight times less than the 80-plus GiB required by HARP.
arXiv Detail & Related papers (2025-06-30T09:09:50Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning [0.32634122554914]
gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity. Existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms. We propose a novel gradient sparsification scheme called ExDyna to address these challenges. In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance.
arXiv Detail & Related papers (2024-02-21T13:00:44Z)
RS-DGC: Exploring Neighborhood Statistics for Dynamic Gradient Compression on Remote Sensing Image Interpretation [23.649838489244917]
gradient sparsification has been validated as an effective gradient compression (GC) technique for reducing communication costs. We propose a simple yet effective dynamic gradient compression scheme leveraging neighborhood statistics indicator for RS image interpretation, RS-DGC. We achieve an accuracy improvement of 0.51% with more than 50 times communication compression on the NWPU-RESISC45 dataset.
arXiv Detail & Related papers (2023-12-29T09:24:26Z)
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy. Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z)
DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning. We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers. We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z)
Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks. We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD. The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z)
Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance. numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Fundamental Limits of Communication Efficiency for Model Aggregation in Distributed Learning: A Rate-Distortion Approach [54.311495894129585]
We study the limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective. It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD.
arXiv Detail & Related papers (2022-06-28T13:10:40Z)
Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network. We introduce a framework (textbfGCGD) to perform gradient correction. Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.