DNN gradient lossless compression: Can GenNorm be the answer?
- URL: http://arxiv.org/abs/2111.07599v1
- Date: Mon, 15 Nov 2021 08:33:10 GMT
- Title: DNN gradient lossless compression: Can GenNorm be the answer?
- Authors: Zhong-Jing Chen, Eduin E. Hernandez, Yu-Chih Huang, Stefano Rini
- Abstract summary: gradient compression is relevant in many distributed Deep Neural Network (DNN) training scenarios.
For some networks of practical interest, the gradient entries can be well modelled as having a generalized normal (GenNorm) distribution.
- Score: 17.37160669785566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, the problem of optimal gradient lossless compression in Deep
Neural Network (DNN) training is considered. Gradient compression is relevant
in many distributed DNN training scenarios, including the recently popular
federated learning (FL) scenario in which each remote users are connected to
the parameter server (PS) through a noiseless but rate limited channel. In
distributed DNN training, if the underlying gradient distribution is available,
classical lossless compression approaches can be used to reduce the number of
bits required for communicating the gradient entries. Mean field analysis has
suggested that gradient updates can be considered as independent random
variables, while Laplace approximation can be used to argue that gradient has a
distribution approximating the normal (Norm) distribution in some regimes. In
this paper we argue that, for some networks of practical interest, the gradient
entries can be well modelled as having a generalized normal (GenNorm)
distribution. We provide numerical evaluations to validate that the hypothesis
GenNorm modelling provides a more accurate prediction of the DNN gradient tail
distribution. Additionally, this modeling choice provides concrete improvement
in terms of lossless compression of the gradients when applying classical
fix-to-variable lossless coding algorithms, such as Huffman coding, to the
quantized gradient updates. This latter results indeed provides an effective
compression strategy with low memory and computational complexity that has
great practical relevance in distributed DNN training scenarios.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - A Low-Complexity Approach to Rate-Distortion Optimized Variable Bit-Rate
Compression for Split DNN Computing [5.3221129103999125]
Split computing has emerged as a recent paradigm for implementation of DNN-based AI workloads.
We present an approach that addresses the challenge of optimizing the rate-accuracy-complexity trade-off.
Our approach is remarkably lightweight, both during training and inference, highly effective and achieves excellent rate-distortion performance.
arXiv Detail & Related papers (2022-08-24T15:02:11Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - A Biased Graph Neural Network Sampler with Near-Optimal Regret [57.70126763759996]
Graph neural networks (GNN) have emerged as a vehicle for applying deep network architectures to graph and relational data.
In this paper, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem.
We introduce a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded payouts.
arXiv Detail & Related papers (2021-03-01T15:55:58Z) - Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy.
We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient.
The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Quantizing data for distributed learning [24.46948464551684]
We consider machine learning applications that train a model by leveraging data over a network, where communication constraints can create a performance bottleneck.
A number of recent approaches propose to overcome this bottleneck through compression of updates, but as models become larger, so does the size of the dataset.
In paper, we propose that quantizes data instead of over gradient updates and can support learning applications.
arXiv Detail & Related papers (2020-12-14T19:54:41Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.