Scaling up Stochastic Gradient Descent for Non-convex Optimisation
- URL: http://arxiv.org/abs/2210.02882v1
- Date: Thu, 6 Oct 2022 13:06:08 GMT
- Title: Scaling up Stochastic Gradient Descent for Non-convex Optimisation
- Authors: Saad Mohamad and Hamad Alamri and Abdelhamid Bouchachia
- Abstract summary: We propose a novel approach to the problem of shared parallel computation.
By combining two strategies into a unified framework, DPSGD is a better trade computation framework.
The potential gains can be achieved by DPSGD on a deep learning (DRL) problem (Latent Diletrichal inference) and on a deep learning (DRL) problem (advantage actor - A2C)
- Score: 5.908471365011942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic gradient descent (SGD) is a widely adopted iterative method for
optimizing differentiable objective functions. In this paper, we propose and
discuss a novel approach to scale up SGD in applications involving non-convex
functions and large datasets. We address the bottleneck problem arising when
using both shared and distributed memory. Typically, the former is bounded by
limited computation resources and bandwidth whereas the latter suffers from
communication overheads. We propose a unified distributed and parallel
implementation of SGD (named DPSGD) that relies on both asynchronous
distribution and lock-free parallelism. By combining two strategies into a
unified framework, DPSGD is able to strike a better trade-off between local
computation and communication. The convergence properties of DPSGD are studied
for non-convex problems such as those arising in statistical modelling and
machine learning. Our theoretical analysis shows that DPSGD leads to speed-up
with respect to the number of cores and number of workers while guaranteeing an
asymptotic convergence rate of $O(1/\sqrt{T})$ given that the number of cores
is bounded by $T^{1/4}$ and the number of workers is bounded by $T^{1/2}$ where
$T$ is the number of iterations. The potential gains that can be achieved by
DPSGD are demonstrated empirically on a stochastic variational inference
problem (Latent Dirichlet Allocation) and on a deep reinforcement learning
(DRL) problem (advantage actor critic - A2C) resulting in two algorithms: DPSVI
and HSA2C. Empirical results validate our theoretical findings. Comparative
studies are conducted to show the performance of the proposed DPSGD against the
state-of-the-art DRL algorithms.
Related papers
- $O(d/T)$ Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions [6.76974373198208]
We establish a fast convergence theory for a popular SDE-based sampler under minimal assumptions.
Our analysis shows that, provided $ell_2$-accurate estimates of the score functions, the total variation distance between the target and generated distributions is upper bounded by $O(d/T)$.
This is achieved through a novel set of analytical tools that provides a fine-grained characterization of how the error propagates at each step of the reverse process.
arXiv Detail & Related papers (2024-09-27T17:59:10Z) - Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms [45.90015262911875]
We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting.
As a by-product of our analysis, we also demonstrate guarantees for gradient-type algorithms such as SGD with random tightness.
arXiv Detail & Related papers (2023-10-31T13:44:53Z) - Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative
Models [49.81937966106691]
We develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models.
In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach.
arXiv Detail & Related papers (2023-06-15T16:30:08Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Multi-task Learning of Order-Consistent Causal Graphs [59.9575145128345]
We consider the problem of discovering $K related Gaussian acyclic graphs (DAGs)
Under multi-task learning setting, we propose a $l_1/l$-regularized maximum likelihood estimator (MLE) for learning $K$ linear structural equation models.
We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order.
arXiv Detail & Related papers (2021-11-03T22:10:18Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.