Related papers: Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization

Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization

URL: http://arxiv.org/abs/2002.11364v2
Date: Thu, 25 Jun 2020 21:36:17 GMT
Title: Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization
Authors: Zhize Li and Dmitry Kovalev and Xun Qian and Peter Richt\'arik
Abstract summary: We propose the first accelerated compressed gradient descent (ACGD) methods. We show that ACGD enjoys the rate $OBig( (1+omega)sqrtfracLmulog frac1epsilonBig)$ for $mu$-strongly convex problems. We also propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $widetildeOBig(+sqrtfracLmu+sqrtbig)$.
Score: 31.066505042748826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the high communication cost in distributed and federated learning problems, methods relying on compression of communicated messages are becoming increasingly popular. While in other contexts the best performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of iterations, there are no methods which combine the benefits of both gradient compression and acceleration. In this paper, we remedy this situation and propose the first accelerated compressed gradient descent (ACGD) methods. In the single machine regime, we prove that ACGD enjoys the rate $O\Big((1+\omega)\sqrt{\frac{L}{\mu}}\log \frac{1}{\epsilon}\Big)$ for $\mu$-strongly convex problems and $O\Big((1+\omega)\sqrt{\frac{L}{\epsilon}}\Big)$ for convex problems, respectively, where $\omega$ is the compression parameter. Our results improve upon the existing non-accelerated rates $O\Big((1+\omega)\frac{L}{\mu}\log \frac{1}{\epsilon}\Big)$ and $O\Big((1+\omega)\frac{L}{\epsilon}\Big)$, respectively, and recover the optimal rates of accelerated gradient descent as a special case when no compression ($\omega=0$) is applied. We further propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $\widetilde{O}\Big(\omega+\sqrt{\frac{L}{\mu}}+\sqrt{\big(\frac{\omega}{n}+\sqrt{\frac{\omega}{n}}\big)\frac{\omega L}{\mu}}\Big)$, where $n$ is the number of devices/workers and $\widetilde{O}$ hides the logarithmic factor $\log \frac{1}{\epsilon}$. This improves upon the previous best result $\widetilde{O}\Big(\omega + \frac{L}{\mu}+\frac{\omega L}{n\mu} \Big)$ achieved by the DIANA method of Mishchenko et al. (2019). Finally, we conduct several experiments on real-world datasets which corroborate our theoretical results and confirm the practical superiority of our accelerated methods.

Related papers

On the $O(\rac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm [54.28350823319057]
This paper establishes the convergence rate $frac1Ksum_k=1KEleft[|nabla f(xk)|_1right]leq O(fracsqrtdCK1/4) for AdamW measured by $ell_$ norm, where $K$ represents the iteration number, $d denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD.
arXiv Detail & Related papers (2025-05-17T05:02:52Z)
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z)
On the Complexity of Finite-Sum Smooth Optimization under the Polyak-{\L}ojasiewicz Condition [14.781921087738967]
This paper considers the optimization problem of the form $min_bf xinmathbb Rd f(bf x)triangleq frac1nsum_i=1n f_i(bf x)$, where $f(cdot)$ satisfies the Polyak--Lojasiewicz (PL) condition with parameter $mu$ and $f_i(cdot)_i=1n$ is $L$-mean-squared smooth.
arXiv Detail & Related papers (2024-02-04T17:14:53Z)
On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z)
The Approximate Degree of DNF and CNF Formulas [95.94432031144716]
For every $delta>0,$ we construct CNF and formulas of size with approximate degree $Omega(n1-delta),$ essentially matching the trivial upper bound of $n. We show that for every $delta>0$, these models require $Omega(n1-delta)$, $Omega(n/4kk2)1-delta$, and $Omega(n/4kk2)1-delta$, respectively.
arXiv Detail & Related papers (2022-09-04T10:01:39Z)
CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression [17.259824817932294]
We propose a emphcompressed and accelerated gradient method for distributed optimization, which we call CANITA. Our results show that as long as the number of devices $n$ is large, CANITA achieves the faster convergence rate $OBig(sqrtfracLepsilonBig)$, i.e., the number of communication rounds is $OBig(sqrtfracLepsilonBig)$.
arXiv Detail & Related papers (2021-07-20T13:01:56Z)
Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$. Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z)
ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method [17.259824817932294]
We propose a novel accelerated variance-reduced method called ANITA for finite-sum optimization. In the general convex setting, ANITA achieves the convergence result $Obig(nminbig1+logfrac1epsilonsqrtn). In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $OBig(big(n+sqrtfracnLmubig)logfracnLmubig)$ matching the lower bound $
arXiv Detail & Related papers (2021-03-21T08:14:40Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
Revisiting EXTRA for Smooth Distributed Optimization [70.65867695317633]
We give a sharp complexity analysis for EXTRA with the improved $Oleft(left(fracLmu+frac11-sigma_2(W)right)logfrac1epsilon (1-sigma_2(W))right)$. Our communication complexities of the accelerated EXTRA are only worse by the factors of $left(logfracLmu (1-sigma_2(W))right)$ and $left(logfrac1epsilon (1-
arXiv Detail & Related papers (2020-02-24T08:07:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.