Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for
Strongly Convex Decentralized Optimization
- URL: http://arxiv.org/abs/2009.04373v3
- Date: Sat, 27 Aug 2022 12:55:53 GMT
- Title: Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for
Strongly Convex Decentralized Optimization
- Authors: Huan Li, Zhouchen Lin, Yongchun Fang
- Abstract summary: We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing.
The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds.
The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
- Score: 69.49313819343992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study stochastic decentralized optimization for the problem of training
machine learning models with large-scale distributed data. We extend the widely
used EXTRA and DIGing methods with variance reduction (VR), and propose two
methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of
$O((\kappa_s+n)\log\frac{1}{\epsilon})$ stochastic gradient evaluations and
$O((\kappa_b+\kappa_c)\log\frac{1}{\epsilon})$ communication rounds to reach
precision $\epsilon$, which are the best complexities among the non-accelerated
gradient-type methods, where $\kappa_s$ and $\kappa_b$ are the stochastic
condition number and batch condition number for strongly convex and smooth
problems, respectively, $\kappa_c$ is the condition number of the communication
network, and $n$ is the sample size on each distributed node. The proposed
VR-DIGing has a little higher communication cost of
$O((\kappa_b+\kappa_c^2)\log\frac{1}{\epsilon})$. Our stochastic gradient
computation complexities are the same as the ones of single-machine VR methods,
such as SAG, SAGA, and SVRG, and our communication complexities keep the same
as those of EXTRA and DIGing, respectively. To further speed up the
convergence, we also propose the accelerated VR-EXTRA and VR-DIGing with both
the optimal $O((\sqrt{n\kappa_s}+n)\log\frac{1}{\epsilon})$ stochastic gradient
computation complexity and $O(\sqrt{\kappa_b\kappa_c}\log\frac{1}{\epsilon})$
communication complexity. Our stochastic gradient computation complexity is
also the same as the ones of single-machine accelerated VR methods, such as
Katyusha, and our communication complexity keeps the same as those of
accelerated full batch decentralized methods, such as MSDA.
Related papers
- Stochastic Constrained Decentralized Optimization for Machine Learning with Fewer Data Oracles: a Gradient Sliding Approach [32.36073823372713]
In machine-learning models, the algorithm has to communicate to the data center and sample data for its gradient.
This gives rise to the need for a decentralized optimization algorithm that is communication-efficient and minimizes the number of gradient computations.
We propose a primal-dual sliding with conditional gradient sliding framework, which is communication-efficient and achieves an $varepsilon$-approximate solution.
arXiv Detail & Related papers (2024-04-03T06:55:59Z) - A Quadratic Synchronization Rule for Distributed Deep Learning [66.68264684667562]
This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR)
Experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies.
arXiv Detail & Related papers (2023-10-22T21:38:57Z) - Variance-reduced accelerated methods for decentralized stochastic
double-regularized nonconvex strongly-concave minimax problems [7.5573375809946395]
We consider a network of $m$ computing agents collaborate via peer-to-peer communications.
Our algorithmic framework introduces agrangian multiplier to eliminate the consensus constraint on the dual variable.
To the best of our knowledge, this is the first work which provides convergence guarantees for NCSC minimax problems with general non regularizers applied to both the primal and dual variables.
arXiv Detail & Related papers (2023-07-14T01:32:16Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Stochastic Gradient Methods with Compressed Communication for
Decentralized Saddle Point Problems [0.0]
We develop two compression based gradient algorithms to solve a class of non-smooth strongly convex-strongly concave saddle-point problems.
Our first algorithm is a Restart-based Decentralized Proximal Gradient method with Compression (C-RDPSG) for general settings.
Next, we present a Decentralized Proximal Variance Reduced Gradient algorithm with Compression (C-DPSVRG) for finite sum setting.
arXiv Detail & Related papers (2022-05-28T15:17:19Z) - ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication
Acceleration! Finally! [6.170898159041277]
We introduce ProxSkip, a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth ($psi$) function.
Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices.
arXiv Detail & Related papers (2022-02-18T18:56:05Z) - Faster Single-loop Algorithms for Minimax Optimization without Strong
Concavity [30.97847679999505]
Gradient descent ascent (GDA) is the simplest single-loop algorithm for nonstationary minimax optimization.
Recent work shows inferior convergence rates of GDA in adversarial networks.
Two alternative single-loop algorithms -- alternating GDA and smoothed GDA -- are presented.
arXiv Detail & Related papers (2021-12-10T15:35:42Z) - AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax
Optimization [104.96004056928474]
We propose a class of faster adaptive gradient descent methods for non-strongly-concave minimax problems.
We show that our method reaches a lower sample complexity of $O(kappa2.5epsilon-3)$ with the mini-batch size $O(kappa)$.
arXiv Detail & Related papers (2021-06-30T14:47:09Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.