Related papers: MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

URL: http://arxiv.org/abs/2510.05361v1
Date: Mon, 06 Oct 2025 20:37:57 GMT
Title: MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates
Authors: Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane,
Abstract summary: Training large models with distributed data parallelism requires frequent communication of gradients across workers.<n>Infrequent communication strategies (e.g., Local SGD) reduce this overhead but often suffer a performance gap relative to fully synchronous DDP.<n>We propose MT-DAO, a family of gradients that employs multiple slow- and fast-moving first momenta or the to track update dynamics across different time scales.
Score: 24.81282608003312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

Related papers

TS-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration [64.32072516882947]
Diffusion Policy excels in embodied control but suffers from high inference latency and computational cost.<n>We propose Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP)<n>TS-DP achieves up to 4.17 times faster inference with over 94% accepted drafts, reaching an inference frequency of 25 Hz.
arXiv Detail & Related papers (2025-12-13T07:53:14Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training [18.103954515791155]
We propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training.<n>PALSGD is an extension of Local SGD (StichNet, DiLoCo) and introduces a pseudo-synchronization mechanism.<n>Our results show that PALSGD achieves better performance in less time compared to existing methods.
arXiv Detail & Related papers (2025-04-25T16:06:08Z)
Dion: Distributed Orthonormalized Updates [27.66769374729482]
We introduce Dion (Distributed Orthonormalization), a scalable and efficient update rule.<n>It replaces Newton-Schulz iteration with amortized power iteration on a momentum buffer.<n>The rank-fraction parameter with error feedback enables low-rank updates that balance quality with significant cost savings.
arXiv Detail & Related papers (2025-04-07T17:49:37Z)
Rack Position Optimization in Large-Scale Heterogeneous Data Centers [38.59029729507364]
This paper presents a novel two-tier optimization framework using a high-level deep reinforcement learning (DRL) model to guide a low-level gradient-based for local search.<n>The high-level DRL agent employs Leader Reward for optimal rack type ordering, and the low-level efficiently maps to positions, minimizing movement counts and ensuring fault-tolerant resource distribution.<n>Our algorithm consistently delivered stable, efficient results - an essential feature for large-scale data center management.
arXiv Detail & Related papers (2025-03-31T22:55:37Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences.<n>PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads.<n>Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates [28.813671194939225]
fully decentralized optimization methods have been advocated as alternatives to the popular parameter server framework. We propose a fully decentralized algorithm with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.
arXiv Detail & Related papers (2023-06-11T02:08:59Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices. We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time. Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging [48.99717153937717]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.<n>We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.<n>Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.