Fast decentralized non-convex finite-sum optimization with recursive
variance reduction
- URL: http://arxiv.org/abs/2008.07428v6
- Date: Sat, 18 Sep 2021 17:26:49 GMT
- Title: Fast decentralized non-convex finite-sum optimization with recursive
variance reduction
- Authors: Ran Xin and Usman A. Khan and Soummya Kar
- Abstract summary: We describe a first-order gradient method, called GT-SARAH, that employs a SARAH-type variance reduction technique.
In particular, in a big-data regime such that $n = O(Nfrac12(lambda)3)$, this complexitys reduces to $O(Nfrac12Lepsilon-2)$, independent of the network complexity.
In addition, we show appropriate choices of local minibatch size balance the trade-offs between gradient complexity and communication complexity.
- Score: 19.540926205375857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers decentralized minimization of $N:=nm$ smooth non-convex
cost functions equally divided over a directed network of $n$ nodes.
Specifically, we describe a stochastic first-order gradient method, called
GT-SARAH, that employs a SARAH-type variance reduction technique and gradient
tracking (GT) to address the stochastic and decentralized nature of the
problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds
an $\epsilon$-accurate first-order stationary point with
$O\big(\max\big\{N^{\frac{1}{2}},n(1-\lambda)^{-2},n^{\frac{2}{3}}m^{\frac{1}{3}}(1-\lambda)^{-1}\big\}L\epsilon^{-2}\big)$
gradient complexity, where ${(1-\lambda)\in(0,1]}$ is the spectral gap of the
network weight matrix and $L$ is the smoothness parameter of the cost
functions. This gradient complexity outperforms that of the existing
decentralized stochastic gradient methods. In particular, in a big-data regime
such that ${n = O(N^{\frac{1}{2}}(1-\lambda)^{3})}$, this gradient complexity
furthers reduces to ${O(N^{\frac{1}{2}}L\epsilon^{-2})}$, independent of the
network topology, and matches that of the centralized near-optimal
variance-reduced methods. Moreover, in this regime GT-SARAH achieves a
non-asymptotic linear speedup, in that, the total number of gradient
computations at each node is reduced by a factor of $1/n$ compared to the
centralized near-optimal algorithms that perform all gradient computations at a
single node. To the best of our knowledge, GT-SARAH is the first algorithm that
achieves this property. In addition, we show that appropriate choices of local
minibatch size balance the trade-offs between the gradient and communication
complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes
in GT-SARAH asymptotically achieve consensus and converge to a first-order
stationary point in the almost sure and mean-squared sense.
Related papers
- Graphon Particle Systems, Part II: Dynamics of Distributed Stochastic Continuum Optimization [5.685037987395183]
We study the distributed optimization problem over a graphon with a continuum of nodes.
We propose gradient descent and gradient tracking algorithms over the graphon.
We show that by choosing the time-varying algorithm gains properly, all nodes' states achieve $mathcalLinfty$-consensus for a connected graphon.
arXiv Detail & Related papers (2024-07-03T02:47:39Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum
Optimization [25.21457349137344]
We show a proof to show DEAREST requires at most $mathcal O(+sqrtmnLvarepsilon-2)$ first-order oracle (IFO) calls and $mathcal O(Lvarepsilon-2/sqrt1-lambda_W)$ communication rounds.
arXiv Detail & Related papers (2022-10-25T11:37:11Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG)
$lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG.
$mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z) - A hybrid variance-reduced method for decentralized stochastic non-convex
optimization [15.447966950703947]
textttGTHSGD algorithm specialized local hybrid gradient implements the network to track the global gradient.
textttGTHSGD achieves a network complexity of$O(n-1)$ when the required error tolerance$epsilon$ is small enough.
arXiv Detail & Related papers (2021-02-12T20:13:05Z) - A New Framework for Variance-Reduced Hamiltonian Monte Carlo [88.84622104944503]
We propose a new framework of variance-reduced Hamiltonian Monte Carlo (HMC) methods for sampling from an $L$-smooth and $m$-strongly log-concave distribution.
We show that the unbiased gradient estimators, including SAGA and SVRG, based HMC methods achieve highest gradient efficiency with small batch size.
Experimental results on both synthetic and real-world benchmark data show that our new framework significantly outperforms the full gradient and gradient HMC approaches.
arXiv Detail & Related papers (2021-02-09T02:44:24Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - S-ADDOPT: Decentralized stochastic first-order optimization over
directed graphs [16.96562173221624]
Decentralized convex optimization is proposed to minimize a sum of smooth and strongly cost functions when the functions are distributed over a directed network nodes.
In particular, we propose thetextbftextttS-ADDOPT algorithm that assumes a first-order oracle at each node.
For decaying step-sizes$mathcalO (1/k)$, we show thattextbfttS-ADDOPT reaches the exact solution subly at$mathcalO (1/k)$ and its convergence is networkally-independent
arXiv Detail & Related papers (2020-05-15T21:14:22Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.