NUQSGD: Provably Communication-efficient Data-parallel SGD via
Nonuniform Quantization
- URL: http://arxiv.org/abs/2104.13818v1
- Date: Wed, 28 Apr 2021 15:07:03 GMT
- Title: NUQSGD: Provably Communication-efficient Data-parallel SGD via
Nonuniform Quantization
- Authors: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan
Alistarh, Daniel M. Roy
- Abstract summary: One popular communication-compression method for data-parallel SGD is QSGD, which quantizes and encodes gradients to reduce communication costs.
The baseline variant of QSGD provides strong theoretical guarantees, but for practical purposes, the authors proposed a variant which we call QSGDinf.
In this paper, we build on this work to propose a new quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of QSGDinf.
- Score: 28.849864002527273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the size and complexity of models and datasets grow, so does the need for
communication-efficient variants of stochastic gradient descent that can be
deployed to perform parallel model training. One popular
communication-compression method for data-parallel SGD is QSGD (Alistarh et
al., 2017), which quantizes and encodes gradients to reduce communication
costs. The baseline variant of QSGD provides strong theoretical guarantees,
however, for practical purposes, the authors proposed a heuristic variant which
we call QSGDinf, which demonstrated impressive empirical gains for distributed
training of large neural networks. In this paper, we build on this work to
propose a new gradient quantization scheme, and show that it has both stronger
theoretical guarantees than QSGD, and matches and exceeds the empirical
performance of the QSGDinf heuristic and of other compression methods.
Related papers
- Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Quantized Distributed Training of Large Models with Convergence
Guarantees [34.054462975511996]
We present QSDP, a variant of FSDP which supports both quant and weight gradientization with theoretical guarantees.
We show that QSDP preserves accuracy, while completely removing the communication of FSDP, providing the speed-to-endups of up to 2.2x.
arXiv Detail & Related papers (2023-02-05T14:20:55Z) - Validation Diagnostics for SBI algorithms based on Normalizing Flows [55.41644538483948]
This work proposes easy to interpret validation diagnostics for multi-dimensional conditional (posterior) density estimators based on NF.
It also offers theoretical guarantees based on results of local consistency.
This work should help the design of better specified models or drive the development of novel SBI-algorithms.
arXiv Detail & Related papers (2022-11-17T15:48:06Z) - Quantized Adaptive Subgradient Algorithms and Their Applications [39.103587572626026]
We propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training.
A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity.
arXiv Detail & Related papers (2022-08-11T04:04:03Z) - Adaptive Step-Size Methods for Compressed SGD [15.32764898836189]
Compressed decentralized Gradient Descent (SGD) algorithms have been recently proposed to address the communication bottleneck in distributed and decentralized networks.
We introduce a scaling step in which we use to establish order- convergence rates for compressed datasets.
We present experimental results on this for real-world datasets.
arXiv Detail & Related papers (2022-07-20T17:20:58Z) - ClusterQ: Semantic Feature Distribution Alignment for Data-Free
Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ.
To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics.
We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z) - Communication-Compressed Adaptive Gradient Method for Distributed
Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers.
Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Feature Quantization Improves GAN Training [126.02828112121874]
Feature Quantization (FQ) for the discriminator embeds both true and fake data samples into a shared discrete space.
Our method can be easily plugged into existing GAN models, with little computational overhead in training.
arXiv Detail & Related papers (2020-04-05T04:06:50Z) - Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees [49.91477656517431]
Quantization-based solvers have been widely adopted in Federated Learning (FL)
No existing methods enjoy all the aforementioned properties.
We propose an intuitively-simple yet theoretically-simple method based on SIGNSGD to bridge the gap.
arXiv Detail & Related papers (2020-02-25T15:12:15Z) - Elastic Consistency: A General Consistency Model for Distributed
Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments.
In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models.
Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.