Related papers: Ordered Momentum for Asynchronous SGD

Ordered Momentum for Asynchronous SGD

URL: http://arxiv.org/abs/2407.19234v3
Date: Thu, 23 Jan 2025 17:04:01 GMT
Title: Ordered Momentum for Asynchronous SGD
Authors: Chang-Wei Shi, Yi-Rui Yang, Wu-Jun Li,
Abstract summary: We propose a novel method called momentum (OrMo) for ASGD.<n>In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their indexes.<n> Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD.
Score: 12.810976838406193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (OrMo) for ASGD. In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis of ASGD with momentum without dependence on the maximum delay. Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD and other asynchronous methods with momentum.

Related papers

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z)
Efficient Federated Learning with Timely Update Dissemination [54.668309196009204]
Federated Learning (FL) has emerged as a compelling methodology for the management of distributed data.<n>We propose an efficient FL approach that capitalizes on additional downlink bandwidth resources to ensure timely update dissemination.
arXiv Detail & Related papers (2025-07-08T14:34:32Z)
Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting [52.6508222408558]
We introduce Elucidated Rolling Diffusion Models (ERDM)<n>ERDM is the first framework to unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM)<n>On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5circ resolution, ERDM consistently outperforms key diffusion-based baselines.
arXiv Detail & Related papers (2025-06-24T21:44:31Z)
Adding Additional Control to One-Step Diffusion with Joint Distribution Matching [58.37264951734603]
JDM is a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model.
arXiv Detail & Related papers (2025-03-09T15:06:50Z)
Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity [92.1840862558718]
Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.
arXiv Detail & Related papers (2025-01-27T16:07:26Z)
Torque-Aware Momentum [28.72697062644016]
We propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks.
arXiv Detail & Related papers (2024-12-25T05:58:07Z)
Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z)
ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order Optimization [18.02643194439027]
This study proposes ZO-AdaMU to adapt the simulated perturbation with momentum in its approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD.
arXiv Detail & Related papers (2023-12-23T07:46:31Z)
Fast Diffusion Model [122.36693015093041]
Diffusion models (DMs) have been adopted across diverse fields with their abilities in capturing intricate data distributions. In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a DM optimization perspective.
arXiv Detail & Related papers (2023-06-12T09:38:04Z)
A Unified Momentum-based Paradigm of Decentralized SGD for Non-Convex Models and Heterogeneous Data [0.261072980439312]
We propose a unified paradigm called U.MP, D-MP and GT-D, which provides a convergence guarantee for non general objectives. In theory we provide the convergence analysis objectives two approaches for these non-MP algorithms.
arXiv Detail & Related papers (2023-03-01T02:13:22Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)
Guaranteed Conservation of Momentum for Learning Particle-based Fluid Dynamics [96.9177297872723]
We present a novel method for guaranteeing linear momentum in learned physics simulations. We enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers. In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially.
arXiv Detail & Related papers (2022-10-12T09:12:59Z)
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z)
Training Generative Adversarial Networks by Solving Ordinary Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training. From this perspective, we hypothesise that instabilities in training GANs arise from the integration error. We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z)
Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization [10.660480034605241]
Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks. In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
arXiv Detail & Related papers (2020-10-01T13:46:32Z)
Joint Stochastic Approximation and Its Application to Learning Discrete Latent Variable Models [19.07718284287928]
We show that the difficulty of obtaining reliable gradients for the inference model and the drawback of indirectly optimizing the target log-likelihood can be gracefully addressed. We propose to directly maximize the target log-likelihood and simultaneously minimize the inclusive divergence between the posterior and the inference model. The resulting learning algorithm is called joint SA (JSA)
arXiv Detail & Related papers (2020-05-28T13:50:08Z)
Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments. In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models. Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.