Related papers: NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

URL: http://arxiv.org/abs/2104.13818v1
Date: Wed, 28 Apr 2021 15:07:03 GMT
Title: NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Authors: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
Abstract summary: One popular communication-compression method for data-parallel SGD is QSGD, which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, but for practical purposes, the authors proposed a variant which we call QSGDinf. In this paper, we build on this work to propose a new quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of QSGDinf.
Score: 28.849864002527273
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

Related papers

Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm [21.155662785923706]
We propose LQ-SGD (Low-Rank Quantized Gradient Descent), an efficient communication gradient compression algorithm designed for distributed training.<n>LQ-SGD further develops on the basis of PowerSGD by incorporating the low-rank approximation and log-quantization techniques, which drastically reduce the communication overhead, while still ensuring the convergence speed of training and model accuracy.
arXiv Detail & Related papers (2025-06-22T10:21:49Z)
Preconditioned Inexact Stochastic ADMM for Deep Model [35.37705488695026]
This paper develops an algorithm, PISA, which enables scalable parallel computing and supports various second-moment schemes. Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz of the gradient. Comprehensive experimental evaluations for or fine-tuning diverse FMs, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate its superior numerical performance compared to various state-of-the-art Directions.
arXiv Detail & Related papers (2025-02-15T12:28:51Z)
On the Convergence of DP-SGD with Adaptive Clipping [56.24689348875711]
Gradient Descent with gradient clipping is a powerful technique for enabling differentially private optimization. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD) We show how QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but can be mitigated through a carefully designed quantile and step size schedule.
arXiv Detail & Related papers (2024-12-27T20:29:47Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
Quantized Distributed Training of Large Models with Convergence Guarantees [34.054462975511996]
We present QSDP, a variant of FSDP which supports both quant and weight gradientization with theoretical guarantees. We show that QSDP preserves accuracy, while completely removing the communication of FSDP, providing the speed-to-endups of up to 2.2x.
arXiv Detail & Related papers (2023-02-05T14:20:55Z)
Validation Diagnostics for SBI algorithms based on Normalizing Flows [55.41644538483948]
This work proposes easy to interpret validation diagnostics for multi-dimensional conditional (posterior) density estimators based on NF. It also offers theoretical guarantees based on results of local consistency. This work should help the design of better specified models or drive the development of novel SBI-algorithms.
arXiv Detail & Related papers (2022-11-17T15:48:06Z)
Quantized Adaptive Subgradient Algorithms and Their Applications [39.103587572626026]
We propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training. A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity.
arXiv Detail & Related papers (2022-08-11T04:04:03Z)
Adaptive Step-Size Methods for Compressed SGD [15.32764898836189]
Compressed decentralized Gradient Descent (SGD) algorithms have been recently proposed to address the communication bottleneck in distributed and decentralized networks. We introduce a scaling step in which we use to establish order- convergence rates for compressed datasets. We present experimental results on this for real-world datasets.
arXiv Detail & Related papers (2022-07-20T17:20:58Z)
ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ. To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics. We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z)
Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers. Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z)
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
Feature Quantization Improves GAN Training [126.02828112121874]
Feature Quantization (FQ) for the discriminator embeds both true and fake data samples into a shared discrete space. Our method can be easily plugged into existing GAN models, with little computational overhead in training.
arXiv Detail & Related papers (2020-04-05T04:06:50Z)
Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees [49.91477656517431]
Quantization-based solvers have been widely adopted in Federated Learning (FL) No existing methods enjoy all the aforementioned properties. We propose an intuitively-simple yet theoretically-simple method based on SIGNSGD to bridge the gap.
arXiv Detail & Related papers (2020-02-25T15:12:15Z)
Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments. In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models. Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)
Q-GADMM: Quantized Group ADMM for Communication Efficient Decentralized Machine Learning [66.18202188565922]
We propose a communication-efficient decentralized machine learning (ML) algorithm, coined QGADMM (QGADMM) We develop a novel quantization method to adaptively adjust modelization levels and their probabilities, while proving the convergence of QGADMM for convex functions.
arXiv Detail & Related papers (2019-10-23T10:47:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.