Accelerating AllReduce with a Persistent Straggler
- URL: http://arxiv.org/abs/2505.23523v1
- Date: Thu, 29 May 2025 15:03:56 GMT
- Title: Accelerating AllReduce with a Persistent Straggler
- Authors: Arjun Devraj, Eric Ding, Abhishek Vijaya Kumar, Robert Kleinberg, Rachee Singh,
- Abstract summary: StragglAR is an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers.<n>On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.
- Score: 5.160343192414151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.
Related papers
- Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity [92.1840862558718]
Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous computation times.<n>This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
arXiv Detail & Related papers (2025-01-27T16:07:26Z) - ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training [16.560270624096706]
We propose textbfACcumulate while textbfCOmmunicate (acco), a memory-efficient optimization algorithm for distributed LLM training.<n>By synchronizing delayed gradients while computing new ones, accoreduces idle time and supports heterogeneous hardware.<n>Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - Asynchronous Distributed Optimization with Delay-free Parameters [9.062164411594175]
This paper develops asynchronous versions of two distributed algorithms, Prox-DGD and DGD-ATC, for solving consensus optimization problems over undirected networks.
In contrast to alternatives, our algorithms can converge to the fixed point set of their synchronous counterparts using step-sizes that are independent of the delays.
arXiv Detail & Related papers (2023-12-11T16:33:38Z) - Robust Fully-Asynchronous Methods for Distributed Training over General Architecture [11.480605289411807]
Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers.
We propose Fully-Asynchronous Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own without any form of impact.
arXiv Detail & Related papers (2023-07-21T14:36:40Z) - DeAR: Accelerating Distributed Deep Learning with Fine-Grained
All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training.
We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations.
We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z) - Teal: Learning-Accelerated Optimization of WAN Traffic Engineering [68.7863363109948]
We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control.
To reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand.
Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
arXiv Detail & Related papers (2022-10-25T04:46:30Z) - SYNTHESIS: A Semi-Asynchronous Path-Integrated Stochastic Gradient
Method for Distributed Learning in Computing Clusters [7.968142741470549]
ulstochastic gradulient ulsearch is developed to overcome the limitations of both synchronous and asynchronous distributed learning algorithms.
algname algorithms have (O(sqrtNepsilon-2(Delta+1) d+N)) and (O(sqrtNepsilon-2(+1) d+N))
(epsilon)-stationary point in non-Delta learning under distributed and shared memory architectures
arXiv Detail & Related papers (2022-08-17T17:42:33Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Lagrangian Decomposition for Neural Network Verification [148.0448557991349]
A fundamental component of neural network verification is the computation of bounds on the values their outputs can take.
We propose a novel approach based on Lagrangian Decomposition.
We show that we obtain bounds comparable with off-the-shelf solvers in a fraction of their running time.
arXiv Detail & Related papers (2020-02-24T17:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.