Related papers: FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

URL: http://arxiv.org/abs/2112.04291v1
Date: Wed, 8 Dec 2021 13:56:24 GMT
Title: FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning
Authors: Keyu Yang, Lu Chen, Zhihao Zeng, Yunjun Gao
Abstract summary: Gradient Descent (SGD) is arguably the workhorse algorithm of distributed Machine Learning (ML) FastSGD represents the gradients as key-value pairs, and compresses both the gradient keys and values in linear time complexity. FastSGD achieves the compression ratio up to 4 orders of magnitude, and accelerates the convergence time up to 8x, compared with state-of-the-art methods.
Score: 16.542846343774357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid increase of big data, distributed Machine Learning (ML) has been widely applied in training large-scale models. Stochastic Gradient Descent (SGD) is arguably the workhorse algorithm of ML. Distributed ML models trained by SGD involve large amounts of gradient communication, which limits the scalability of distributed ML. Thus, it is important to compress the gradients for reducing communication. In this paper, we propose FastSGD, a Fast compressed SGD framework for distributed ML. To achieve a high compression ratio at a low cost, FastSGD represents the gradients as key-value pairs, and compresses both the gradient keys and values in linear time complexity. For the gradient value compression, FastSGD first uses a reciprocal mapper to transform original values into reciprocal values, and then, it utilizes a logarithm quantization to further reduce reciprocal values to small integers. Finally, FastSGD filters reduced gradient integers by a given threshold. For the gradient key compression, FastSGD provides an adaptive fine-grained delta encoding method to store gradient keys with fewer bits. Extensive experiments on practical ML models and datasets demonstrate that FastSGD achieves the compression ratio up to 4 orders of magnitude, and accelerates the convergence time up to 8x, compared with state-of-the-art methods.

Related papers

Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size [0.6906005491572401]
Using an increasing batch size leads to faster RSGD than using a constant batch size. Experiments on principal component analysis and low-rank matrix problems confirmed that, using a growth batch size or an exponential growth batch size results in better performance than using a constant batch size.
arXiv Detail & Related papers (2025-01-30T06:23:28Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Gradient-free Decoder Inversion in Latent Diffusion Models [18.493960162113712]
In latent diffusion models (LDMs), denoising diffusion process efficiently takes place on latent space whose dimension is lower than that of pixel space. We propose an efficient gradient-free decoder inversion for LDMs, which can be applied to diverse latent models.
arXiv Detail & Related papers (2024-09-27T04:38:14Z)
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting. We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)
Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Communication-Efficient Federated Learning via Quantized Compressed Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server. Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression. We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z)
S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning [11.21739015522637]
We propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap. Our results show that S2 Reducer converges to the same accuracy, reduces 81% sparse communication overhead, and 1.8$ times $ speedup compared to state-of-the-art approaches.
arXiv Detail & Related papers (2021-10-05T16:14:40Z)
Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD. Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
Federated Stochastic Gradient Langevin Dynamics [12.180900849847252]
gradient MCMC methods, such as gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling. We propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates. We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails.
arXiv Detail & Related papers (2020-04-23T15:25:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.