Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs
- URL: http://arxiv.org/abs/2502.17009v2
- Date: Fri, 28 Feb 2025 00:12:11 GMT
- Title: Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs
- Authors: Enea Monzio Compagnoni, Rustem Islamov, Frank Norbert Proske, Aurelien Lucchi,
- Abstract summary: Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets.<n>Their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood.<n>This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using differential equations.
- Score: 2.218667838700643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets. However, their benefits often come at the cost of increased communication overhead between the central server and agents, which can become the main bottleneck, making training costly or even unfeasible in such systems. Compression methods such as quantization and sparsification can alleviate this issue. Still, their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood. This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using stochastic differential equations (SDEs). Our results show that DCSGD with unbiased compression is more vulnerable to noise in stochastic gradients, while DSignSGD remains robust, even under large and heavy-tailed noise. Additionally, we propose new scaling rules for hyperparameter tuning to mitigate performance degradation due to compression. These findings are empirically validated across multiple deep learning architectures and datasets, providing practical recommendations for distributed optimization.
Related papers
- Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner.<n>Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores.<n> Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z) - Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity.<n>Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z) - Data-Aware Gradient Compression for FL in Communication-Constrained Mobile Computing [20.70238092277094]
Federated Learning (FL) in mobile environments faces significant communication bottlenecks.
One-size-fits-all compression approach does not account for the varying data volumes across workers.
We propose varying compression ratios to workers with distinct data distributions and volumes.
arXiv Detail & Related papers (2023-11-13T13:24:09Z) - FedDIP: Federated Learning with Extreme Dynamic Pruning and Incremental
Regularization [5.182014186927254]
Federated Learning (FL) has been successfully adopted for distributed training and inference of large-scale Deep Neural Networks (DNNs)
We contribute with a novel FL framework (coined FedDIP) which combines (i) dynamic model pruning with error feedback to eliminate redundant information exchange.
We provide convergence analysis of FedDIP and report on a comprehensive performance and comparative assessment against state-of-the-art methods.
arXiv Detail & Related papers (2023-09-13T08:51:19Z) - Compressed Regression over Adaptive Networks [58.79251288443156]
We derive the performance achievable by a network of distributed agents that solve, adaptively and in the presence of communication constraints, a regression problem.
We devise an optimized allocation strategy where the parameters necessary for the optimization can be learned online by the agents.
arXiv Detail & Related papers (2023-04-07T13:41:08Z) - $z$-SignFedAvg: A Unified Stochastic Sign-based Compression for
Federated Learning [14.363110221372274]
Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm.
FL suffers from high communication cost when training large-scale machine learning models.
We propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression.
arXiv Detail & Related papers (2023-02-06T06:54:49Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Fast Distributionally Robust Learning with Variance Reduced Min-Max
Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications.
Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients.
We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - On the Utility of Gradient Compression in Distributed Training Systems [9.017890174185872]
We evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD.
Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative.
arXiv Detail & Related papers (2021-02-28T15:58:45Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - On Communication Compression for Distributed Optimization on
Heterogeneous Data [28.197694894254305]
Lossy gradient compression has become a key tool to avoid the communication bottleneck in distributed training of machine learning models.
We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD with arbitrary unbiased quantizers and (ii) distributed SGD with error-feedback and biased compressors.
Our results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high.
arXiv Detail & Related papers (2020-09-04T20:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.