Related papers: Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum

Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum

URL: http://arxiv.org/abs/2110.00625v1
Date: Fri, 1 Oct 2021 19:23:18 GMT
Title: Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum
Authors: Guojing Cong and Tianyi Liu
Abstract summary: We propose a momentum method for such model averaging approaches. We analyze the convergence and scaling properties of such momentum methods. Our experimental results show that block momentum not only accelerates training, but also achieves better results.
Score: 12.324457683544132
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Momentum method has been used extensively in optimizers for deep learning. Recent studies show that distributed training through K-step averaging has many nice properties. We propose a momentum method for such model averaging approaches. At each individual learner level traditional stochastic gradient is applied. At the meta-level (global learner level), one momentum term is applied and we call it block momentum. We analyze the convergence and scaling properties of such momentum methods. Our experimental results show that block momentum not only accelerates training, but also achieves better results.

Related papers

Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization [1.4303041760959478]
Overshoot is a momentum-based descent optimization method designed to enhance performance beyond standard and Nesterov's momentum. Overshoot consistently outperforms both standard and Nesterov's momentum across a wide range of tasks.
arXiv Detail & Related papers (2025-01-16T14:18:10Z)
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective [9.566967680633615]
We present a frequency domain analysis framework that interprets the momentum method as a time-variant filter. Our experiments support this perspective and provide a deeper understanding of the mechanism involved.
arXiv Detail & Related papers (2024-11-29T12:56:43Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z)
Guaranteed Conservation of Momentum for Learning Particle-based Fluid Dynamics [96.9177297872723]
We present a novel method for guaranteeing linear momentum in learned physics simulations. We enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers. In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially.
arXiv Detail & Related papers (2022-10-12T09:12:59Z)
Last-iterate convergence analysis of stochastic momentum methods for neural networks [3.57214198937538]
The momentum method is used to solve large-scale optimization problems in neural networks. Current convergence results of momentum methods under artificial settings. The momentum factors can be fixed to be constant, rather than in existing time.
arXiv Detail & Related papers (2022-05-30T02:17:44Z)
A Discrete Variational Derivation of Accelerated Methods in Optimization [68.8204255655161]
We introduce variational which allow us to derive different methods for optimization. We derive two families of optimization methods in one-to-one correspondence. The preservation of symplecticity of autonomous systems occurs here solely on the fibers.
arXiv Detail & Related papers (2021-06-04T20:21:53Z)
SMG: A Shuffling Gradient-Based Method with Momentum [25.389545522794172]
We combine two advanced ideas widely used in optimization for machine learning. We develop a novel shuffling-based momentum technique. Our tests have shown encouraging performance of the new algorithms.
arXiv Detail & Related papers (2020-11-24T04:12:35Z)
Training Generative Adversarial Networks by Solving Ordinary Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training. From this perspective, we hypothesise that instabilities in training GANs arise from the integration error. We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z)
Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning. In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z)
A New Accelerated Stochastic Gradient Method with Momentum [4.967897656554012]
gradient descent with momentum (Sgdm) use weights that decay exponentially with the iteration times to generate an momentum term. We provide theoretical convergence properties analyses for our method, which show both the exponentially decay weights and our inverse proportionally decay weights can limit the variance of the moving direction of parameters to be optimized to a region.
arXiv Detail & Related papers (2020-05-31T03:04:32Z)
Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs [71.26657499537366]
We propose a simple literature-based method for the efficient approximation of gradients in neural ODE models. We compare it with the reverse dynamic method to train neural ODEs on classification, density estimation, and inference approximation tasks.
arXiv Detail & Related papers (2020-03-11T13:15:57Z)
Average-case Acceleration Through Spectral Density Estimation [35.01931431231649]
We develop a framework for the average-case analysis of random quadratic problems. We derive algorithms that are optimal under this analysis. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions.
arXiv Detail & Related papers (2020-02-12T01:44:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.