Ordered Local Momentum for Asynchronous Distributed Learning under Arbitrary Delays
- URL: http://arxiv.org/abs/2601.12322v1
- Date: Sun, 18 Jan 2026 09:05:39 GMT
- Title: Ordered Local Momentum for Asynchronous Distributed Learning under Arbitrary Delays
- Authors: Chang-Wei Shi, Shi-Shang Wang, Wu-Jun Li,
- Abstract summary: We propose a novel method, called OrLoMo, for asynchronous distributed learning.<n>In OrLoMo, each worker runs MSGD locally. Then the local momentum from each worker will be aggregated by global in order.
- Score: 22.743876197077785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Momentum SGD (MSGD) serves as a foundational optimizer in training deep models due to momentum's key role in accelerating convergence and enhancing generalization. Meanwhile, asynchronous distributed learning is crucial for training large-scale deep models, especially when the computing capabilities of the workers in the cluster are heterogeneous. To reduce communication frequency, local updates are widely adopted in distributed learning. However, how to implement asynchronous distributed MSGD with local updates remains unexplored. To solve this problem, we propose a novel method, called \underline{or}dered \underline{lo}cal \underline{mo}mentum (OrLoMo), for asynchronous distributed learning. In OrLoMo, each worker runs MSGD locally. Then the local momentum from each worker will be aggregated by the server in order based on its global iteration index. To the best of our knowledge, OrLoMo is the first method to implement asynchronous distributed MSGD with local updates. We prove the convergence of OrLoMo for non-convex problems under arbitrary delays. Experiments validate that OrLoMo can outperform its synchronous counterpart and other asynchronous methods.
Related papers
- HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning [60.59553507555341]
We introduce CycleSL, a novel aggregation-free split learning framework.<n>Inspired by alternating block coordinate descent, CycleSL treats server-side training as an independent higher-level machine learning task.<n>Our empirical findings highlight the effectiveness of CycleSL in enhancing model performance.
arXiv Detail & Related papers (2025-11-23T21:00:21Z) - HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training [41.629510958918225]
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware.<n>We propose HALoS, a global asynchronous optimization framework that introduces local servers (LPSs) within each region and merges hierarchical stragglers across intraregion links.<n> Empirically, HALoS attains up to 7.5x faster convergence than baselines in geodistributed training and improves upon existing asynchronous methods by up to 2.1x.
arXiv Detail & Related papers (2025-06-05T00:48:55Z) - Adaptive Deadline and Batch Layered Synchronized Federated Learning [66.93447103966439]
Federated learning (FL) enables collaborative model training across distributed edge devices while preserving data privacy, and typically operates in a round-based synchronous manner.<n>We propose ADEL-FL, a novel framework that jointly optimize per-round deadlines and user-specific batch sizes for layer-wise aggregation.
arXiv Detail & Related papers (2025-05-29T19:59:18Z) - Ordered Momentum for Asynchronous SGD [12.810976838406193]
We propose a novel method called momentum (OrMo) for ASGD.<n>In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their indexes.<n> Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD.
arXiv Detail & Related papers (2024-07-27T11:35:19Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - Asynchronous Local-SGD Training for Language Modeling [37.02427878640653]
Local gradient descent (Local-SGD) is an approach to distributed optimization where each device performs more than one SGD update per communication.
This work presents an empirical study of it asynchronous Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps.
arXiv Detail & Related papers (2024-01-17T11:17:04Z) - Efficient and Light-Weight Federated Learning via Asynchronous
Distributed Dropout [22.584080337157168]
Asynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup.
We propose textttAsyncDrop, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings.
Overall, textttAsyncDrop achieves better performance compared to state of the art asynchronous methodologies.
arXiv Detail & Related papers (2022-10-28T13:00:29Z) - Locally Asynchronous Stochastic Gradient Descent for Decentralised Deep
Learning [0.0]
Local Asynchronous SGD (LASGD) is an asynchronous decentralized algorithm that relies on All Reduce for model synchronization.
We empirically validate LASGD's performance on image classification tasks on the ImageNet dataset.
arXiv Detail & Related papers (2022-03-24T14:25:15Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Double Momentum SGD for Federated Learning [94.58442574293021]
We propose a new SGD variant named as DOMO to improve the model performance in federated learning.
One momentum buffer tracks the server update direction, while the other tracks the local update direction.
We introduce a novel server momentum fusion technique to coordinate the server and local momentum SGD.
arXiv Detail & Related papers (2021-02-08T02:47:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.