Related papers: CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

URL: http://arxiv.org/abs/2107.09461v1
Date: Tue, 20 Jul 2021 13:01:56 GMT
Title: CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression
Authors: Zhize Li, Peter Richt\'arik
Abstract summary: We propose a emphcompressed and accelerated gradient method for distributed optimization, which we call CANITA. Our results show that as long as the number of devices $n$ is large, CANITA achieves the faster convergence rate $OBig(sqrtfracLepsilonBig)$, i.e., the number of communication rounds is $OBig(sqrtfracLepsilonBig)$.
Score: 17.259824817932294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent (Nesterov, 2004) and Adam (Kingma and Ba, 2014). In order to combine the benefits of communication compression and convergence acceleration, we propose a \emph{compressed and accelerated} gradient method for distributed optimization, which we call CANITA. Our CANITA achieves the \emph{first accelerated rate} $O\bigg(\sqrt{\Big(1+\sqrt{\frac{\omega^3}{n}}\Big)\frac{L}{\epsilon}} + \omega\big(\frac{1}{\epsilon}\big)^{\frac{1}{3}}\bigg)$, which improves upon the state-of-the-art non-accelerated rate $O\left((1+\frac{\omega}{n})\frac{L}{\epsilon} + \frac{\omega^2+n}{\omega+n}\frac{1}{\epsilon}\right)$ of DIANA (Khaled et al., 2020b) for distributed general convex problems, where $\epsilon$ is the target error, $L$ is the smooth parameter of the objective, $n$ is the number of machines/devices, and $\omega$ is the compression parameter (larger $\omega$ means more compression can be applied, and no compression implies $\omega=0$). Our results show that as long as the number of devices $n$ is large (often true in distributed/federated learning), or the compression $\omega$ is not very high, CANITA achieves the faster convergence rate $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$, i.e., the number of communication rounds is $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ (vs. $O\big(\frac{L}{\epsilon}\big)$ achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

Related papers

Asynchronous Approximate Agreement with Quadratic Communication [23.27199615640474]
We consider an asynchronous network of $n$ message-sending parties, up to $t$ of which are byzantine. We study approximate agreement, where the parties obtain approximately equal outputs in the convex hull of their inputs. This takes $Theta(n2)$ messages per reliable broadcast, or $Theta(n3)$ messages per iteration.
arXiv Detail & Related papers (2024-08-10T09:03:06Z)
Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity [22.615156512223763]
We propose variance- optimistic sliding (SVOGS) method, which takes the advantage of the finite-sum structure in the objective. We prove $mathcal O(delta D2/varepsilon)$, communication complexity of $mathcal O(n+sqrtndelta D2/varepsilon)$, and local calls of $tildemathcal O(n+sqrtndelta+L)D2/varepsilon)$.
arXiv Detail & Related papers (2024-05-25T08:34:49Z)
Nearly Optimal Regret for Decentralized Online Convex Optimization [53.433398074919]
Decentralized online convex optimization (D-OCO) aims to minimize a sequence of global loss functions using only local computations and communications. We develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions. Our algorithms are nearly optimal in terms of $T$, $n$, and $rho$.
arXiv Detail & Related papers (2024-02-14T13:44:16Z)
On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
The Approximate Degree of DNF and CNF Formulas [95.94432031144716]
For every $delta>0,$ we construct CNF and formulas of size with approximate degree $Omega(n1-delta),$ essentially matching the trivial upper bound of $n. We show that for every $delta>0$, these models require $Omega(n1-delta)$, $Omega(n/4kk2)1-delta$, and $Omega(n/4kk2)1-delta$, respectively.
arXiv Detail & Related papers (2022-09-04T10:01:39Z)
Faster Rates for Compressed Federated Learning with Client-Variance Reduction [23.169998435268504]
We show that COFIG and FRECON converge within $O(frac(1+omega)sqrtNSepsilon2)$ communication rounds. In the convex setting COFIG converges within $O(frac(1+omega)sqrtNSepsilon2)$ communication rounds, which, to the best of our knowledge, is the first result.
arXiv Detail & Related papers (2021-12-24T16:28:18Z)
Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization [59.65871549878937]
We prove that the practical single loop accelerated gradient tracking needs $O(fracgamma1-sigma_gamma)2sqrtfracLepsilon)$. Our convergence rates improve significantly over the ones of $O(frac1epsilon5/7)$ and $O(fracLmu)5/7frac1 (1-sigma)1.5logfrac1epsilon)$.
arXiv Detail & Related papers (2021-04-06T15:34:14Z)
ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method [17.259824817932294]
We propose a novel accelerated variance-reduced method called ANITA for finite-sum optimization. In the general convex setting, ANITA achieves the convergence result $Obig(nminbig1+logfrac1epsilonsqrtn). In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $OBig(big(n+sqrtfracnLmubig)logfracnLmubig)$ matching the lower bound $
arXiv Detail & Related papers (2021-03-21T08:14:40Z)
An Optimal Separation of Randomized and Quantum Query Complexity [67.19751155411075]
We prove that for every decision tree, the absolute values of the Fourier coefficients of a given order $ellsqrtbinomdell (1+log n)ell-1,$ sum to at most $cellsqrtbinomdell (1+log n)ell-1,$ where $n$ is the number of variables, $d$ is the tree depth, and $c>0$ is an absolute constant.
arXiv Detail & Related papers (2020-08-24T06:50:57Z)
Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization [31.066505042748826]
We propose the first accelerated compressed gradient descent (ACGD) methods. We show that ACGD enjoys the rate $OBig( (1+omega)sqrtfracLmulog frac1epsilonBig)$ for $mu$-strongly convex problems. We also propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $widetildeOBig(+sqrtfracLmu+sqrtbig)$.
arXiv Detail & Related papers (2020-02-26T09:03:23Z)
Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging. We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.