Related papers: Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization

Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization

URL: http://arxiv.org/abs/2009.04373v3
Date: Sat, 27 Aug 2022 12:55:53 GMT
Title: Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization
Authors: Huan Li, Zhouchen Lin, Yongchun Fang
Abstract summary: We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds. The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
Score: 69.49313819343992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study stochastic decentralized optimization for the problem of training machine learning models with large-scale distributed data. We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O((\kappa_s+n)\log\frac{1}{\epsilon})$ stochastic gradient evaluations and $O((\kappa_b+\kappa_c)\log\frac{1}{\epsilon})$ communication rounds to reach precision $\epsilon$, which are the best complexities among the non-accelerated gradient-type methods, where $\kappa_s$ and $\kappa_b$ are the stochastic condition number and batch condition number for strongly convex and smooth problems, respectively, $\kappa_c$ is the condition number of the communication network, and $n$ is the sample size on each distributed node. The proposed VR-DIGing has a little higher communication cost of $O((\kappa_b+\kappa_c^2)\log\frac{1}{\epsilon})$. Our stochastic gradient computation complexities are the same as the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and our communication complexities keep the same as those of EXTRA and DIGing, respectively. To further speed up the convergence, we also propose the accelerated VR-EXTRA and VR-DIGing with both the optimal $O((\sqrt{n\kappa_s}+n)\log\frac{1}{\epsilon})$ stochastic gradient computation complexity and $O(\sqrt{\kappa_b\kappa_c}\log\frac{1}{\epsilon})$ communication complexity. Our stochastic gradient computation complexity is also the same as the ones of single-machine accelerated VR methods, such as Katyusha, and our communication complexity keeps the same as those of accelerated full batch decentralized methods, such as MSDA.

Related papers

Stochastic Constrained Decentralized Optimization for Machine Learning with Fewer Data Oracles: a Gradient Sliding Approach [32.36073823372713]
In machine-learning models, the algorithm has to communicate to the data center and sample data for its gradient. This gives rise to the need for a decentralized optimization algorithm that is communication-efficient and minimizes the number of gradient computations. We propose a primal-dual sliding with conditional gradient sliding framework, which is communication-efficient and achieves an $varepsilon$-approximate solution.
arXiv Detail & Related papers (2024-04-03T06:55:59Z)
A Quadratic Synchronization Rule for Distributed Deep Learning [66.68264684667562]
This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR) Experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies.
arXiv Detail & Related papers (2023-10-22T21:38:57Z)
Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems [7.5573375809946395]
We consider a network of $m$ computing agents collaborate via peer-to-peer communications. Our algorithmic framework introduces agrangian multiplier to eliminate the consensus constraint on the dual variable. To the best of our knowledge, this is the first work which provides convergence guarantees for NCSC minimax problems with general non regularizers applied to both the primal and dual variables.
arXiv Detail & Related papers (2023-07-14T01:32:16Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
Stochastic Gradient Methods with Compressed Communication for Decentralized Saddle Point Problems [0.0]
We develop two compression based gradient algorithms to solve a class of non-smooth strongly convex-strongly concave saddle-point problems. Our first algorithm is a Restart-based Decentralized Proximal Gradient method with Compression (C-RDPSG) for general settings. Next, we present a Decentralized Proximal Variance Reduced Gradient algorithm with Compression (C-DPSVRG) for finite sum setting.
arXiv Detail & Related papers (2022-05-28T15:17:19Z)
ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally! [6.170898159041277]
We introduce ProxSkip, a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth ($psi$) function. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices.
arXiv Detail & Related papers (2022-02-18T18:56:05Z)
Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction [25.94147708122371]
We propose two novel adaptive VR algorithms: Adaptive Variance Reduced Accelerated Extra-Gradient (AdaVRAE) and Adaptive Variance Reduced Accelerated Gradient (AdaVRAG) Our algorithms do not require knowledge of the smoothness parameter. We demonstrate the superior performance of our algorithms compared with previous methods in experiments on real-world datasets.
arXiv Detail & Related papers (2022-01-28T18:07:25Z)
Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity [30.97847679999505]
Gradient descent ascent (GDA) is the simplest single-loop algorithm for nonstationary minimax optimization. Recent work shows inferior convergence rates of GDA in adversarial networks. Two alternative single-loop algorithms -- alternating GDA and smoothed GDA -- are presented.
arXiv Detail & Related papers (2021-12-10T15:35:42Z)
AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization [104.96004056928474]
We propose a class of faster adaptive gradient descent methods for non-strongly-concave minimax problems. We show that our method reaches a lower sample complexity of $O(kappa2.5epsilon-3)$ with the mini-batch size $O(kappa)$.
arXiv Detail & Related papers (2021-06-30T14:47:09Z)
Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$.<n>Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.