Related papers: Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

URL: http://arxiv.org/abs/2602.03515v1
Date: Tue, 03 Feb 2026 13:31:51 GMT
Title: Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation
Authors: Hyunji Jung, Sungbin Shin, Namhoon Lee,
Abstract summary: Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution.<n>This efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process.<n>We propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings.
Score: 6.710608163117798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.

Related papers

Temporal Pair Consistency for Variance-Reduced Flow Matching [13.328987133593154]
Temporal Pair Consistency (TPC) is a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path.<n>Instantiated within flow matching, TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions.
arXiv Detail & Related papers (2026-02-04T00:05:21Z)
Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z)
Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency [2.1990852305468533]
Synchronous federated learning (FL) scales poorly with the number of clients due to the straggler effect.<n>In this work, we rely on modeling and analysis to better understand the impact of design choices in asynchronous FL algorithms.<n>We characterize in particular a fundamental trade-off for optimizing asynchronous FL: minimizing gradient estimation errors by avoiding model parameter staleness.
arXiv Detail & Related papers (2025-02-12T08:38:13Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Delay-adaptive step-sizes for asynchronous learning [8.272788656521415]
We show that it is possible to use learning rates that depend on the actual time-varying delays in the system. For each of these methods, we demonstrate how delays can be measured on-line, present delay-adaptive step-size policies, and illustrate their theoretical and practical advantages over the state-of-the-art.
arXiv Detail & Related papers (2022-02-17T09:51:22Z)
From Deterioration to Acceleration: A Calibration Approach to Rehabilitating Step Asynchronism in Federated Optimization [13.755421424240048]
We propose a new algorithm textttFedaGrac, which calibrates the local direction to a predictive global orientation. We theoretically prove that textttFedaGrac holds an improved order of convergence rate than the state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-17T07:26:31Z)
Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness. We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z)
Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.