Quantized Distributed Training of Large Models with Convergence
Guarantees
- URL: http://arxiv.org/abs/2302.02390v1
- Date: Sun, 5 Feb 2023 14:20:55 GMT
- Title: Quantized Distributed Training of Large Models with Convergence
Guarantees
- Authors: Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh
- Abstract summary: We present QSDP, a variant of FSDP which supports both quant and weight gradientization with theoretical guarantees.
We show that QSDP preserves accuracy, while completely removing the communication of FSDP, providing the speed-to-endups of up to 2.2x.
- Score: 34.054462975511996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Communication-reduction techniques are a popular way to improve scalability
in data-parallel training of deep neural networks (DNNs). The recent emergence
of large language models such as GPT has created the need for new approaches to
exploit data-parallelism. Among these, fully-sharded data parallel (FSDP)
training is highly popular, yet it still encounters scalability bottlenecks.
One reason is that applying compression techniques to FSDP is challenging: as
the vast majority of the communication involves the model's weights, direct
compression alters convergence and leads to accuracy loss. We present QSDP, a
variant of FSDP which supports both gradient and weight quantization with
theoretical guarantees, is simple to implement and has essentially no
overheads. To derive QSDP we prove that a natural modification of SGD achieves
convergence even when we only maintain quantized weights, and thus the domain
over which we train consists of quantized points and is, therefore, highly
non-convex. We validate this approach by training GPT-family models with up to
1.3 billion parameters on a multi-node cluster. Experiments show that QSDP
preserves model accuracy, while completely removing the communication
bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.
Related papers
- Fed-CVLC: Compressing Federated Learning Communications with
Variable-Length Codes [54.18186259484828]
In Federated Learning (FL) paradigm, a parameter server (PS) concurrently communicates with distributed participating clients for model collection, update aggregation, and model distribution over multiple rounds.
We show strong evidences that variable-length is beneficial for compression in FL.
We present Fed-CVLC (Federated Learning Compression with Variable-Length Codes), which fine-tunes the code length in response to the dynamics of model updates.
arXiv Detail & Related papers (2024-02-06T07:25:21Z) - FedDIP: Federated Learning with Extreme Dynamic Pruning and Incremental
Regularization [5.182014186927254]
Federated Learning (FL) has been successfully adopted for distributed training and inference of large-scale Deep Neural Networks (DNNs)
We contribute with a novel FL framework (coined FedDIP) which combines (i) dynamic model pruning with error feedback to eliminate redundant information exchange.
We provide convergence analysis of FedDIP and report on a comprehensive performance and comparative assessment against state-of-the-art methods.
arXiv Detail & Related papers (2023-09-13T08:51:19Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - A Low-Complexity Approach to Rate-Distortion Optimized Variable Bit-Rate
Compression for Split DNN Computing [5.3221129103999125]
Split computing has emerged as a recent paradigm for implementation of DNN-based AI workloads.
We present an approach that addresses the challenge of optimizing the rate-accuracy-complexity trade-off.
Our approach is remarkably lightweight, both during training and inference, highly effective and achieves excellent rate-distortion performance.
arXiv Detail & Related papers (2022-08-24T15:02:11Z) - ClusterQ: Semantic Feature Distribution Alignment for Data-Free
Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ.
To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics.
We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - NUQSGD: Provably Communication-efficient Data-parallel SGD via
Nonuniform Quantization [28.849864002527273]
One popular communication-compression method for data-parallel SGD is QSGD, which quantizes and encodes gradients to reduce communication costs.
The baseline variant of QSGD provides strong theoretical guarantees, but for practical purposes, the authors proposed a variant which we call QSGDinf.
In this paper, we build on this work to propose a new quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of QSGDinf.
arXiv Detail & Related papers (2021-04-28T15:07:03Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.