CANITA: Faster Rates for Distributed Convex Optimization with
Communication Compression
- URL: http://arxiv.org/abs/2107.09461v1
- Date: Tue, 20 Jul 2021 13:01:56 GMT
- Title: CANITA: Faster Rates for Distributed Convex Optimization with
Communication Compression
- Authors: Zhize Li, Peter Richt\'arik
- Abstract summary: We propose a emphcompressed and accelerated gradient method for distributed optimization, which we call CANITA.
Our results show that as long as the number of devices $n$ is large, CANITA achieves the faster convergence rate $OBig(sqrtfracLepsilonBig)$, i.e., the number of communication rounds is $OBig(sqrtfracLepsilonBig)$.
- Score: 17.259824817932294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the high communication cost in distributed and federated learning,
methods relying on compressed communication are becoming increasingly popular.
Besides, the best theoretically and practically performing gradient-type
methods invariably rely on some form of acceleration/momentum to reduce the
number of communications (faster convergence), e.g., Nesterov's accelerated
gradient descent (Nesterov, 2004) and Adam (Kingma and Ba, 2014). In order to
combine the benefits of communication compression and convergence acceleration,
we propose a \emph{compressed and accelerated} gradient method for distributed
optimization, which we call CANITA. Our CANITA achieves the \emph{first
accelerated rate}
$O\bigg(\sqrt{\Big(1+\sqrt{\frac{\omega^3}{n}}\Big)\frac{L}{\epsilon}} +
\omega\big(\frac{1}{\epsilon}\big)^{\frac{1}{3}}\bigg)$, which improves upon
the state-of-the-art non-accelerated rate
$O\left((1+\frac{\omega}{n})\frac{L}{\epsilon} +
\frac{\omega^2+n}{\omega+n}\frac{1}{\epsilon}\right)$ of DIANA (Khaled et al.,
2020b) for distributed general convex problems, where $\epsilon$ is the target
error, $L$ is the smooth parameter of the objective, $n$ is the number of
machines/devices, and $\omega$ is the compression parameter (larger $\omega$
means more compression can be applied, and no compression implies $\omega=0$).
Our results show that as long as the number of devices $n$ is large (often true
in distributed/federated learning), or the compression $\omega$ is not very
high, CANITA achieves the faster convergence rate
$O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$, i.e., the number of communication
rounds is $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ (vs.
$O\big(\frac{L}{\epsilon}\big)$ achieved by previous works). As a result,
CANITA enjoys the advantages of both compression (compressed communication in
each round) and acceleration (much fewer communication rounds).
Related papers
- Asynchronous Approximate Agreement with Quadratic Communication [23.27199615640474]
We consider an asynchronous network of $n$ message-sending parties, up to $t$ of which are byzantine.
We study approximate agreement, where the parties obtain approximately equal outputs in the convex hull of their inputs.
This takes $Theta(n2)$ messages per reliable broadcast, or $Theta(n3)$ messages per iteration.
arXiv Detail & Related papers (2024-08-10T09:03:06Z) - Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity [22.615156512223763]
We propose variance- optimistic sliding (SVOGS) method, which takes the advantage of the finite-sum structure in the objective.
We prove $mathcal O(delta D2/varepsilon)$, communication complexity of $mathcal O(n+sqrtndelta D2/varepsilon)$, and local calls of $tildemathcal O(n+sqrtndelta+L)D2/varepsilon)$.
arXiv Detail & Related papers (2024-05-25T08:34:49Z) - Nearly Optimal Regret for Decentralized Online Convex Optimization [53.433398074919]
Decentralized online convex optimization (D-OCO) aims to minimize a sequence of global loss functions using only local computations and communications.
We develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions.
Our algorithms are nearly optimal in terms of $T$, $n$, and $rho$.
arXiv Detail & Related papers (2024-02-14T13:44:16Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - The Approximate Degree of DNF and CNF Formulas [95.94432031144716]
For every $delta>0,$ we construct CNF and formulas of size with approximate degree $Omega(n1-delta),$ essentially matching the trivial upper bound of $n.
We show that for every $delta>0$, these models require $Omega(n1-delta)$, $Omega(n/4kk2)1-delta$, and $Omega(n/4kk2)1-delta$, respectively.
arXiv Detail & Related papers (2022-09-04T10:01:39Z) - Faster Rates for Compressed Federated Learning with Client-Variance
Reduction [23.169998435268504]
We show that COFIG and FRECON converge within $O(frac(1+omega)sqrtNSepsilon2)$ communication rounds.
In the convex setting COFIG converges within $O(frac(1+omega)sqrtNSepsilon2)$ communication rounds, which, to the best of our knowledge, is the first result.
arXiv Detail & Related papers (2021-12-24T16:28:18Z) - Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$.
Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z) - ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method [17.259824817932294]
We propose a novel accelerated variance-reduced method called ANITA for finite-sum optimization.
In the general convex setting, ANITA achieves the convergence result $Obig(nminbig1+logfrac1epsilonsqrtn).
In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $OBig(big(n+sqrtfracnLmubig)logfracnLmubig)$ matching the lower bound $
arXiv Detail & Related papers (2021-03-21T08:14:40Z) - An Optimal Separation of Randomized and Quantum Query Complexity [67.19751155411075]
We prove that for every decision tree, the absolute values of the Fourier coefficients of a given order $ellsqrtbinomdell (1+log n)ell-1,$ sum to at most $cellsqrtbinomdell (1+log n)ell-1,$ where $n$ is the number of variables, $d$ is the tree depth, and $c>0$ is an absolute constant.
arXiv Detail & Related papers (2020-08-24T06:50:57Z) - Acceleration for Compressed Gradient Descent in Distributed and
Federated Optimization [31.066505042748826]
We propose the first accelerated compressed gradient descent (ACGD) methods.
We show that ACGD enjoys the rate $OBig( (1+omega)sqrtfracLmulog frac1epsilonBig)$ for $mu$-strongly convex problems.
We also propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $widetildeOBig(+sqrtfracLmu+sqrtbig)$.
arXiv Detail & Related papers (2020-02-26T09:03:23Z) - Curse of Dimensionality on Randomized Smoothing for Certifiable
Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging.
We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.