L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and
Accurate Deep Learning
- URL: http://arxiv.org/abs/2210.17357v2
- Date: Fri, 9 Jun 2023 17:11:26 GMT
- Title: L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and
Accurate Deep Learning
- Authors: Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh
- Abstract summary: We provide a framework for adapting the degree of compression across the model's layers dynamically during training.
Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers.
- Score: 24.712888488317816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-parallel distributed training of deep neural networks (DNN) has gained
very widespread adoption, but can still experience communication bottlenecks.
To address this issue, entire families of compression mechanisms have been
developed, including quantization, sparsification, and low-rank approximation,
some of which are seeing significant practical adoption. Despite this progress,
almost all known compression schemes apply compression uniformly across DNN
layers, although layers are heterogeneous in terms of parameter count and their
impact on model accuracy. In this work, we provide a general framework for
adapting the degree of compression across the model's layers dynamically during
training, improving the overall compression, while leading to substantial
speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based
on an adaptive algorithm, which automatically picks the optimal compression
parameters for model layers guaranteeing the best compression ratio while
satisfying an error constraint. Extensive experiments over image classification
and language modeling tasks shows that L-GreCo is effective across all existing
families of compression methods, and achieves up to 2.5$\times$ training
speedup and up to 5$\times$ compression improvement over efficient
implementations of existing approaches, while recovering full accuracy.
Moreover, L-GreCo is complementary to existing adaptive algorithms, improving
their compression ratio by 50% and practical throughput by 66%.
Related papers
- Communication-Efficient Distributed Learning with Local Immediate Error
Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm.
LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z) - Accelerating Distributed Deep Learning using Lossless Homomorphic
Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation.
We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - Adaptive Step-Size Methods for Compressed SGD [15.32764898836189]
Compressed decentralized Gradient Descent (SGD) algorithms have been recently proposed to address the communication bottleneck in distributed and decentralized networks.
We introduce a scaling step in which we use to establish order- convergence rates for compressed datasets.
We present experimental results on this for real-world datasets.
arXiv Detail & Related papers (2022-07-20T17:20:58Z) - Optimal Rate Adaption in Federated Learning with Compressed
Communications [28.16239232265479]
Federated Learning incurs high communication overhead, which can be greatly alleviated by compression for model updates.
tradeoff between compression and model accuracy in the networked environment remains unclear.
We present a framework to maximize the final model accuracy by strategically adjusting the compression each iteration.
arXiv Detail & Related papers (2021-12-13T14:26:15Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Neural Network Compression Via Sparse Optimization [23.184290795230897]
We propose a model compression framework based on the recent progress on sparse optimization.
We achieve up to 7.2 and 2.9 times FLOPs reduction with the same level of evaluation of accuracy on VGG16 for CIFAR10 and ResNet50 for ImageNet.
arXiv Detail & Related papers (2020-11-10T03:03:55Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Structured Sparsification with Joint Optimization of Group Convolution
and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression.
The proposed method automatically induces structured sparsity on the convolutional weights.
We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.