Preserving Near-Optimal Gradient Sparsification Cost for Scalable
Distributed Deep Learning
- URL: http://arxiv.org/abs/2402.13781v1
- Date: Wed, 21 Feb 2024 13:00:44 GMT
- Title: Preserving Near-Optimal Gradient Sparsification Cost for Scalable
Distributed Deep Learning
- Authors: Daegun Yoon, Sangyoon Oh
- Abstract summary: gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity.
Existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms.
We propose a novel gradient sparsification scheme called ExDyna to address these challenges.
In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance.
- Score: 0.32634122554914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communication overhead is a major obstacle to scaling distributed training
systems. Gradient sparsification is a potential optimization approach to reduce
the communication volume without significant loss of model fidelity. However,
existing gradient sparsification methods have low scalability owing to
inefficient design of their algorithms, which raises the communication overhead
significantly. In particular, gradient build-up and inadequate sparsity control
methods degrade the sparsification performance considerably. Moreover,
communication traffic increases drastically owing to workload imbalance of
gradient selection between workers.
To address these challenges, we propose a novel gradient sparsification
scheme called ExDyna. In ExDyna, the gradient tensor of the model comprises
fined-grained blocks, and contiguous blocks are grouped into non-overlapping
partitions. Each worker selects gradients in its exclusively allocated
partition so that gradient build-up never occurs. To balance the workload of
gradient selection between workers, ExDyna adjusts the topology of partitions
by comparing the workloads of adjacent partitions. In addition, ExDyna supports
online threshold scaling, which estimates the accurate threshold of gradient
selection on-the-fly. Accordingly, ExDyna can satisfy the user-required
sparsity level during a training period regardless of models and datasets.
Therefore, ExDyna can enhance the scalability of distributed training systems
by preserving near-optimal gradient sparsification cost. In experiments, ExDyna
outperformed state-of-the-art sparsifiers in terms of training speed and
sparsification performance while achieving high accuracy.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and
Accelerating Distributed DNN Training [0.32634122554914]
gradient sparsification is a technique for scaling and accelerating distributed deep neural network (DNN) training.
Existing sparsifiers have poor scalability because of the high computational cost of gradient selection.
We propose a novel gradient sparsification method called MiCRO.
In our experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.
arXiv Detail & Related papers (2023-10-02T08:15:35Z) - GIFD: A Generative Gradient Inversion Method with Feature Domain
Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy.
Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge.
We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z) - DEFT: Exploiting Gradient Norm Difference between Model Layers for
Scalable Gradient Sparsification [0.6091702876917281]
gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning.
We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers.
We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
arXiv Detail & Related papers (2023-07-07T10:29:25Z) - End-to-End Diffusion Latent Optimization Improves Classifier Guidance [81.27364542975235]
Direct Optimization of Diffusion Latents (DOODL) is a novel guidance method.
It enables plug-and-play guidance by optimizing diffusion latents.
It outperforms one-step classifier guidance on computational and human evaluation metrics.
arXiv Detail & Related papers (2023-03-23T22:43:52Z) - Adaptive Top-K in SGD for Communication-Efficient Distributed Learning [14.867068493072885]
This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance.
numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-10-24T18:33:35Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy.
We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient.
The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.