Related papers: Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

URL: http://arxiv.org/abs/1908.04207v5
Date: Thu, 21 Aug 2025 07:39:24 GMT
Title: Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Authors: Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler,
Abstract summary: We propose eager-SGD, which relaxes the global synchronization for decentralized accumulation.<n>We show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.
Score: 49.26578529891149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.

Related papers

Do We Need Asynchronous SGD? On the Near-Optimality of Synchronous Methods [59.72933231179977]
We revisit Synchronous SGD and its robust variant, called $m$-Synchronous SGD, and theoretically show that they are nearly optimal in many heterogeneous computation scenarios.<n>While synchronous methods are not universal solutions and there exist tasks where asynchronous methods may be necessary, we show that they are sufficient for many modern heterogeneous computation scenarios.
arXiv Detail & Related papers (2026-02-03T18:02:14Z)
Class-wise Balancing Data Replay for Federated Class-Incremental Learning [49.179631011790065]
We propose a class wise balancing data replay method for Federated Class Incremental Learning (FCIL)<n>FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes.
arXiv Detail & Related papers (2025-07-10T12:46:31Z)
Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity [92.1840862558718]
Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous computation times.<n>This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
arXiv Detail & Related papers (2025-01-27T16:07:26Z)
DropCompute: simple and more robust distributed synchronous training via compute variance reduction [30.46681332866494]
We study a typical scenario in which workers are straggling due to variability in compute time. We propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training.
arXiv Detail & Related papers (2023-06-18T16:55:31Z)
Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches [3.736244431175932]
Non-blocking SGD can address the straggler problem in a heterogeneous environment. Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
arXiv Detail & Related papers (2022-11-02T05:25:01Z)
DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training [30.574484395380043]
Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes. We propose DeLacent large-batch momentum performance models.
arXiv Detail & Related papers (2021-04-24T16:21:01Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging [48.99717153937717]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.<n>We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.<n>Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.