AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism
- URL: http://arxiv.org/abs/2601.22442v1
- Date: Fri, 30 Jan 2026 01:24:47 GMT
- Title: AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism
- Authors: Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham, Hadi Mohaghegh Dolatabadi, Chamin P Hewa Koneputugodage, Violetta Shevchenko, Yan Zuo, Alexander Long,
- Abstract summary: We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
- Score: 54.8494905524997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.
Related papers
- AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training [4.643969942380424]
We propose a novel asynchronous variant of ZeRO to achieve superior performance while maintaining simplicity and memory efficiency.<n>Unlike traditional ZeRO, which employs over-fine-grained sharding that can lead to inefficient communication, AsyncHZP adaptively reshards parameters, gradients, and states across different replica groups.<n>AsyncHZP consistently outperforms classic ND parallelism, achieving state-of-the-art performance without complex strategic tuning.
arXiv Detail & Related papers (2025-10-23T01:29:35Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems [6.453224262551299]
Large language models (LLMs) in recommendation systems have become increasingly prominent.<n>This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism.<n> Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30%.
arXiv Detail & Related papers (2025-06-21T02:37:25Z) - Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z) - Hybrid Approach to Parallel Stochastic Gradient Descent [0.0]
We propose a third approach to data parallelism which is a hybrid between synchronous and asynchronous approaches.
In a given time period our hybrid approach outperforms both asynchronous and synchronous approaches.
arXiv Detail & Related papers (2024-06-27T06:28:30Z) - Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step.
Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z) - Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework.
We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with.
We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.