Learned Gradient Compression for Distributed Deep Learning
- URL: http://arxiv.org/abs/2103.08870v2
- Date: Wed, 17 Mar 2021 05:55:33 GMT
- Title: Learned Gradient Compression for Distributed Deep Learning
- Authors: Lusine Abrahamyan, Yiming Chen, Giannis Bekoulis and Nikos Deligiannis
- Abstract summary: Training deep neural networks on large datasets containing high-dimensional data requires a large amount of computation.
A solution to this problem is data-parallel distributed training, where a model is replicated into several computational nodes that have access to different chunks of the data.
This approach, however, entails high communication rates and latency because of the computed gradients that need to be shared among nodes at every iteration.
- Score: 16.892546958602303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training deep neural networks on large datasets containing high-dimensional
data requires a large amount of computation. A solution to this problem is
data-parallel distributed training, where a model is replicated into several
computational nodes that have access to different chunks of the data. This
approach, however, entails high communication rates and latency because of the
computed gradients that need to be shared among nodes at every iteration. The
problem becomes more pronounced in the case that there is wireless
communication between the nodes (i.e. due to the limited network bandwidth). To
address this problem, various compression methods have been proposed including
sparsification, quantization, and entropy encoding of the gradients. Existing
methods leverage the intra-node information redundancy, that is, they compress
gradients at each node independently. In contrast, we advocate that the
gradients across the nodes are correlated and propose methods to leverage this
inter-node redundancy to improve compression efficiency. Depending on the node
communication protocol (parameter server or ring-allreduce), we propose two
instances of the LGC approach that we coin Learned Gradient Compression (LGC).
Our methods exploit an autoencoder (i.e. trained during the first stages of the
distributed training) to capture the common information that exists in the
gradients of the distributed nodes. We have tested our LGC methods on the image
classification and semantic segmentation tasks using different convolutional
neural networks (ResNet50, ResNet101, PSPNet) and multiple datasets (ImageNet,
Cifar10, CamVid). The ResNet101 model trained for image classification on
Cifar10 achieved an accuracy of 93.57%, which is lower than the baseline
distributed training with uncompressed gradients only by 0.18%.
Related papers
- CDFGNN: a Systematic Design of Cache-based Distributed Full-Batch Graph Neural Network Training with Communication Reduction [7.048300785744331]
Graph neural network training is mainly categorized into mini-batch and full-batch training methods.
In the distributed cluster, frequent remote accesses of features and gradients lead to huge communication overhead.
We introduce the cached-based distributed full-batch graph neural network training framework (CDFGNN)
Our results indicate that CDFGNN has great potential in accelerating distributed full-batch GNN training tasks.
arXiv Detail & Related papers (2024-08-01T01:57:09Z) - Distributed Training of Large Graph Neural Networks with Variable Communication Rates [71.7293735221656]
Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements.
Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs.
We introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model.
arXiv Detail & Related papers (2024-06-25T14:57:38Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Is Network the Bottleneck of Distributed Training? [36.925680383195356]
We take a first-principles approach to measure and analyze the network performance of distributed training.
We find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one.
arXiv Detail & Related papers (2020-06-17T19:00:31Z) - Cross-filter compression for CNN inference acceleration [4.324080238456531]
We propose a new cross-filter compression method that can provide $sim32times$ memory savings and $122times$ speed up in convolution operations.
Our method, based on Binary-Weight and XNOR-Net separately, is evaluated on CIFAR-10 and ImageNet dataset.
arXiv Detail & Related papers (2020-05-18T19:06:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.