Acceleration for Compressed Gradient Descent in Distributed and
Federated Optimization
- URL: http://arxiv.org/abs/2002.11364v2
- Date: Thu, 25 Jun 2020 21:36:17 GMT
- Title: Acceleration for Compressed Gradient Descent in Distributed and
Federated Optimization
- Authors: Zhize Li and Dmitry Kovalev and Xun Qian and Peter Richt\'arik
- Abstract summary: We propose the first accelerated compressed gradient descent (ACGD) methods.
We show that ACGD enjoys the rate $OBig( (1+omega)sqrtfracLmulog frac1epsilonBig)$ for $mu$-strongly convex problems.
We also propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $widetildeOBig(+sqrtfracLmu+sqrtbig)$.
- Score: 31.066505042748826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the high communication cost in distributed and federated learning
problems, methods relying on compression of communicated messages are becoming
increasingly popular. While in other contexts the best performing gradient-type
methods invariably rely on some form of acceleration/momentum to reduce the
number of iterations, there are no methods which combine the benefits of both
gradient compression and acceleration. In this paper, we remedy this situation
and propose the first accelerated compressed gradient descent (ACGD) methods.
In the single machine regime, we prove that ACGD enjoys the rate
$O\Big((1+\omega)\sqrt{\frac{L}{\mu}}\log \frac{1}{\epsilon}\Big)$ for
$\mu$-strongly convex problems and
$O\Big((1+\omega)\sqrt{\frac{L}{\epsilon}}\Big)$ for convex problems,
respectively, where $\omega$ is the compression parameter. Our results improve
upon the existing non-accelerated rates $O\Big((1+\omega)\frac{L}{\mu}\log
\frac{1}{\epsilon}\Big)$ and $O\Big((1+\omega)\frac{L}{\epsilon}\Big)$,
respectively, and recover the optimal rates of accelerated gradient descent as
a special case when no compression ($\omega=0$) is applied. We further propose
a distributed variant of ACGD (called ADIANA) and prove the convergence rate
$\widetilde{O}\Big(\omega+\sqrt{\frac{L}{\mu}}+\sqrt{\big(\frac{\omega}{n}+\sqrt{\frac{\omega}{n}}\big)\frac{\omega
L}{\mu}}\Big)$, where $n$ is the number of devices/workers and $\widetilde{O}$
hides the logarithmic factor $\log \frac{1}{\epsilon}$. This improves upon the
previous best result $\widetilde{O}\Big(\omega + \frac{L}{\mu}+\frac{\omega
L}{n\mu} \Big)$ achieved by the DIANA method of Mishchenko et al. (2019).
Finally, we conduct several experiments on real-world datasets which
corroborate our theoretical results and confirm the practical superiority of
our accelerated methods.
Related papers
- On the Complexity of Finite-Sum Smooth Optimization under the
Polyak-{\L}ojasiewicz Condition [14.781921087738967]
This paper considers the optimization problem of the form $min_bf xinmathbb Rd f(bf x)triangleq frac1nsum_i=1n f_i(bf x)$, where $f(cdot)$ satisfies the Polyak--Lojasiewicz (PL) condition with parameter $mu$ and $f_i(cdot)_i=1n$ is $L$-mean-squared smooth.
arXiv Detail & Related papers (2024-02-04T17:14:53Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - The Approximate Degree of DNF and CNF Formulas [95.94432031144716]
For every $delta>0,$ we construct CNF and formulas of size with approximate degree $Omega(n1-delta),$ essentially matching the trivial upper bound of $n.
We show that for every $delta>0$, these models require $Omega(n1-delta)$, $Omega(n/4kk2)1-delta$, and $Omega(n/4kk2)1-delta$, respectively.
arXiv Detail & Related papers (2022-09-04T10:01:39Z) - CANITA: Faster Rates for Distributed Convex Optimization with
Communication Compression [17.259824817932294]
We propose a emphcompressed and accelerated gradient method for distributed optimization, which we call CANITA.
Our results show that as long as the number of devices $n$ is large, CANITA achieves the faster convergence rate $OBig(sqrtfracLepsilonBig)$, i.e., the number of communication rounds is $OBig(sqrtfracLepsilonBig)$.
arXiv Detail & Related papers (2021-07-20T13:01:56Z) - Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$.
Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z) - ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method [17.259824817932294]
We propose a novel accelerated variance-reduced method called ANITA for finite-sum optimization.
In the general convex setting, ANITA achieves the convergence result $Obig(nminbig1+logfrac1epsilonsqrtn).
In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $OBig(big(n+sqrtfracnLmubig)logfracnLmubig)$ matching the lower bound $
arXiv Detail & Related papers (2021-03-21T08:14:40Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Revisiting EXTRA for Smooth Distributed Optimization [70.65867695317633]
We give a sharp complexity analysis for EXTRA with the improved $Oleft(left(fracLmu+frac11-sigma_2(W)right)logfrac1epsilon (1-sigma_2(W))right)$.
Our communication complexities of the accelerated EXTRA are only worse by the factors of $left(logfracLmu (1-sigma_2(W))right)$ and $left(logfrac1epsilon (1-
arXiv Detail & Related papers (2020-02-24T08:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.