DEFT: Exploiting Gradient Norm Difference between Model Layers for
Scalable Gradient Sparsification
- URL: http://arxiv.org/abs/2307.03500v3
- Date: Thu, 13 Jul 2023 11:30:59 GMT
- Title: DEFT: Exploiting Gradient Norm Difference between Model Layers for
Scalable Gradient Sparsification
- Authors: Daegun Yoon, Sangyoon Oh
- Abstract summary: gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning.
We propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers.
We show that DEFT shows a significant improvement in training performance in terms of speed in gradient selection over existing sparsifiers.
- Score: 0.6091702876917281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient sparsification is a widely adopted solution for reducing the
excessive communication traffic in distributed deep learning. However, most
existing gradient sparsifiers have relatively poor scalability because of
considerable computational cost of gradient selection and/or increased
communication traffic owing to gradient build-up. To address these challenges,
we propose a novel gradient sparsification scheme, DEFT, that partitions the
gradient selection task into sub tasks and distributes them to workers. DEFT
differs from existing sparsifiers, wherein every worker selects gradients among
all gradients. Consequently, the computational cost can be reduced as the
number of workers increases. Moreover, gradient build-up can be eliminated
because DEFT allows workers to select gradients in partitions that are
non-intersecting (between workers). Therefore, even if the number of workers
increases, the communication traffic can be maintained as per user requirement.
To avoid the loss of significance of gradient selection, DEFT selects more
gradients in the layers that have a larger gradient norm than the other layers.
Because every layer has a different computational load, DEFT allocates layers
to workers using a bin-packing algorithm to maintain a balanced load of
gradient selection between workers. In our empirical evaluation, DEFT shows a
significant improvement in training performance in terms of speed in gradient
selection over existing sparsifiers while achieving high convergence
performance.
Related papers
- Preserving Near-Optimal Gradient Sparsification Cost for Scalable
Distributed Deep Learning [0.32634122554914]
gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity.
Existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms.
We propose a novel gradient sparsification scheme called ExDyna to address these challenges.
In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance.
arXiv Detail & Related papers (2024-02-21T13:00:44Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and
Accelerating Distributed DNN Training [0.32634122554914]
gradient sparsification is a technique for scaling and accelerating distributed deep neural network (DNN) training.
Existing sparsifiers have poor scalability because of the high computational cost of gradient selection.
We propose a novel gradient sparsification method called MiCRO.
In our experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.
arXiv Detail & Related papers (2023-10-02T08:15:35Z) - GIFD: A Generative Gradient Inversion Method with Feature Domain
Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy.
Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge.
We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z) - Nested Gradient Codes for Straggler Mitigation in Distributed Machine
Learning [21.319460501659666]
Gradient codes are designed to tolerate a fixed number of stragglers.
We propose a gradient coding scheme that can tolerate a flexible number of stragglers.
By proper task scheduling and small additional signaling, our scheme adapts the load of the workers to the actual number of stragglers.
arXiv Detail & Related papers (2022-12-16T16:56:51Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep
Learning in a Supercomputing Environment [0.6091702876917281]
gradient sparsification has been proposed to reduce the communication traffic significantly.
Top-k gradient sparsification (Top-k SGD) has a limit to increase the speed up overall training performance.
We conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance.
arXiv Detail & Related papers (2022-09-18T07:42:31Z) - Layerwise Optimization by Gradient Decomposition for Continual Learning [78.58714373218118]
Deep neural networks achieve state-of-the-art and sometimes super-human performance across various domains.
When learning tasks sequentially, the networks easily forget the knowledge of previous tasks, known as "catastrophic forgetting"
arXiv Detail & Related papers (2021-05-17T01:15:57Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.