Periodic Stochastic Gradient Descent with Momentum for Decentralized
Training
- URL: http://arxiv.org/abs/2008.10435v1
- Date: Mon, 24 Aug 2020 13:38:22 GMT
- Title: Periodic Stochastic Gradient Descent with Momentum for Decentralized
Training
- Authors: Hongchang Gao, Heng Huang
- Abstract summary: We propose a novel periodic decentralized momentum SGD method, which employs the momentum schema and periodic communication for decentralized training.
We conduct extensive experiments to verify the performance of our proposed two methods, and both of them have shown superior performance over existing methods.
- Score: 114.36410688552579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decentralized training has been actively studied in recent years. Although a
wide variety of methods have been proposed, yet the decentralized momentum SGD
method is still underexplored. In this paper, we propose a novel periodic
decentralized momentum SGD method, which employs the momentum schema and
periodic communication for decentralized training. With these two strategies,
as well as the topology of the decentralized training system, the theoretical
convergence analysis of our proposed method is difficult. We address this
challenging problem and provide the condition under which our proposed method
can achieve the linear speedup regarding the number of workers. Furthermore, we
also introduce a communication-efficient variant to reduce the communication
cost in each communication round. The condition for achieving the linear
speedup is also provided for this variant. To the best of our knowledge, these
two methods are all the first ones achieving these theoretical results in their
corresponding domain. We conduct extensive experiments to verify the
performance of our proposed two methods, and both of them have shown superior
performance over existing methods.
Related papers
- From promise to practice: realizing high-performance decentralized training [8.955918346078935]
Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce.
This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-it runtimes.
arXiv Detail & Related papers (2024-10-15T19:04:56Z) - Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations.
We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games.
Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Local Methods with Adaptivity via Scaling [38.99428012275441]
This paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods.
We consider the classical Local SGD method and enhance it with a scaling feature.
In addition to theoretical analysis, we validate the performance of our methods in practice by training a neural network.
arXiv Detail & Related papers (2024-06-02T19:50:05Z) - Scalable Optimal Margin Distribution Machine [50.281535710689795]
Optimal margin Distribution Machine (ODM) is a newly proposed statistical learning framework rooting in the novel margin theory.
This paper proposes a scalable ODM, which can achieve nearly ten times speedup compared to the original ODM training method.
arXiv Detail & Related papers (2023-05-08T16:34:04Z) - Guaranteed Conservation of Momentum for Learning Particle-based Fluid
Dynamics [96.9177297872723]
We present a novel method for guaranteeing linear momentum in learned physics simulations.
We enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers.
In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially.
arXiv Detail & Related papers (2022-10-12T09:12:59Z) - Fast and Robust Sparsity Learning over Networks: A Decentralized
Surrogate Median Regression Approach [10.850336820582678]
We propose a decentralized surrogate median regression (deSMR) method for efficiently solving the decentralized sparsity learning problem.
Our proposed algorithm enjoys a linear convergence rate with a simple implementation.
We also establish the theoretical results for sparse support recovery.
arXiv Detail & Related papers (2022-02-11T08:16:01Z) - A general framework for decentralized optimization with first-order
methods [11.50057411285458]
Decentralized optimization to minimize a finite sum of functions over a network of nodes has been a significant focus in control and signal processing research.
The emergence of sophisticated computing and large-scale data science needs have led to a resurgence of activity in this area.
We discuss decentralized first-order gradient methods, which have found tremendous success in control, signal processing, and machine learning problems.
arXiv Detail & Related papers (2020-09-12T17:52:10Z) - Adaptive Serverless Learning [114.36410688552579]
We propose a novel adaptive decentralized training approach, which can compute the learning rate from data dynamically.
Our theoretical results reveal that the proposed algorithm can achieve linear speedup with respect to the number of workers.
To reduce the communication-efficient overhead, we further propose a communication-efficient adaptive decentralized training approach.
arXiv Detail & Related papers (2020-08-24T13:23:02Z) - Step-Ahead Error Feedback for Distributed Training with Compressed
Gradient [99.42912552638168]
We show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training.
We propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis.
arXiv Detail & Related papers (2020-08-13T11:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.