FastSGD: A Fast Compressed SGD Framework for Distributed Machine
Learning
- URL: http://arxiv.org/abs/2112.04291v1
- Date: Wed, 8 Dec 2021 13:56:24 GMT
- Title: FastSGD: A Fast Compressed SGD Framework for Distributed Machine
Learning
- Authors: Keyu Yang, Lu Chen, Zhihao Zeng, Yunjun Gao
- Abstract summary: Gradient Descent (SGD) is arguably the workhorse algorithm of distributed Machine Learning (ML)
FastSGD represents the gradients as key-value pairs, and compresses both the gradient keys and values in linear time complexity.
FastSGD achieves the compression ratio up to 4 orders of magnitude, and accelerates the convergence time up to 8x, compared with state-of-the-art methods.
- Score: 16.542846343774357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid increase of big data, distributed Machine Learning (ML) has
been widely applied in training large-scale models. Stochastic Gradient Descent
(SGD) is arguably the workhorse algorithm of ML. Distributed ML models trained
by SGD involve large amounts of gradient communication, which limits the
scalability of distributed ML. Thus, it is important to compress the gradients
for reducing communication. In this paper, we propose FastSGD, a Fast
compressed SGD framework for distributed ML. To achieve a high compression
ratio at a low cost, FastSGD represents the gradients as key-value pairs, and
compresses both the gradient keys and values in linear time complexity. For the
gradient value compression, FastSGD first uses a reciprocal mapper to transform
original values into reciprocal values, and then, it utilizes a logarithm
quantization to further reduce reciprocal values to small integers. Finally,
FastSGD filters reduced gradient integers by a given threshold. For the
gradient key compression, FastSGD provides an adaptive fine-grained delta
encoding method to store gradient keys with fewer bits. Extensive experiments
on practical ML models and datasets demonstrate that FastSGD achieves the
compression ratio up to 4 orders of magnitude, and accelerates the convergence
time up to 8x, compared with state-of-the-art methods.
Related papers
- Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size [0.6906005491572401]
Using an increasing batch size leads to faster RSGD than using a constant batch size.
Experiments on principal component analysis and low-rank matrix problems confirmed that, using a growth batch size or an exponential growth batch size results in better performance than using a constant batch size.
arXiv Detail & Related papers (2025-01-30T06:23:28Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Gradient-free Decoder Inversion in Latent Diffusion Models [18.493960162113712]
In latent diffusion models (LDMs), denoising diffusion process efficiently takes place on latent space whose dimension is lower than that of pixel space.
We propose an efficient gradient-free decoder inversion for LDMs, which can be applied to diverse latent models.
arXiv Detail & Related papers (2024-09-27T04:38:14Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.