ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication
Acceleration! Finally!
- URL: http://arxiv.org/abs/2202.09357v2
- Date: Fri, 24 Mar 2023 11:39:47 GMT
- Title: ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication
Acceleration! Finally!
- Authors: Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich and Peter
Richt\'arik
- Abstract summary: We introduce ProxSkip, a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth ($psi$) function.
Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices.
- Score: 6.170898159041277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ProxSkip -- a surprisingly simple and provably efficient method
for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable
($\psi$) function. The canonical approach to solving such problems is via the
proximal gradient descent (ProxGD) algorithm, which is based on the evaluation
of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In
this work we are specifically interested in the regime in which the evaluation
of prox is costly relative to the evaluation of the gradient, which is the case
in many applications. ProxSkip allows for the expensive prox operator to be
skipped in most iterations: while its iteration complexity is
$\mathcal{O}\left(\kappa \log \frac{1}{\varepsilon}\right)$, where $\kappa$ is
the condition number of $f$, the number of prox evaluations is
$\mathcal{O}\left(\sqrt{\kappa} \log \frac{1}{\varepsilon}\right)$ only. Our
main motivation comes from federated learning, where evaluation of the gradient
operator corresponds to taking a local GD step independently on all devices,
and evaluation of prox corresponds to (expensive) communication in the form of
gradient averaging. In this context, ProxSkip offers an effective acceleration
of communication complexity. Unlike other local gradient-type methods, such as
FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication
complexity is worse than, or at best matching, that of vanilla GD in the
heterogeneous data regime, we obtain a provable and large improvement without
any heterogeneity-bounding assumptions.
Related papers
- ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Decomposable Non-Smooth Convex Optimization with Nearly-Linear Gradient
Oracle Complexity [15.18055488087588]
We give an algorithm that minimizes the above convex formulation to $epsilon$-accuracy in $widetildeO(sum_i=1n d_i log (1 /epsilon))$ gradient computations.
Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cutting-plane and interior-point methods.
arXiv Detail & Related papers (2022-08-07T20:53:42Z) - DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$.
We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z) - Optimal Gradient Sliding and its Application to Distributed Optimization
Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity.
We proposed a method to solve problems master to agents' communication and local calls.
The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z) - The First Optimal Algorithm for Smooth and
Strongly-Convex-Strongly-Concave Minimax Optimization [88.91190483500932]
In this paper, we revisit the smooth and strongly-strongly-concave minimax optimization problem.
Existing state-of-the-art methods do not match lower bound $Omegaleft(sqrtkappa_xkappa_ylog3 (kappa_xkappa_y)logfrac1epsilonright)$.
We fix this fundamental issue by providing the first algorithm with $mathcalOleft( sqrtkappa_xkappa_ylog
arXiv Detail & Related papers (2022-05-11T17:33:07Z) - Restarted Nonconvex Accelerated Gradient Descent: No More
Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity [70.65867695317633]
We propose two simple accelerated gradient methods, restarted gradient descent (AGD) and restarted ball (HB) method.
We establish that our methods achieve an $frac1epsilon)$ number of gradient iterations.
Our algorithms are simple in the sense that they only consist of Nestov's classical AGD orak's HB as well as a restart mechanism.
arXiv Detail & Related papers (2022-01-27T10:04:04Z) - A One-bit, Comparison-Based Gradient Estimator [29.600975900977343]
We show how one can use tools from one-bit compressed sensing to construct a robust and reliable estimator of the normalized gradient.
We propose an algorithm, coined SCOBO, that uses this estimator within a gradient descent scheme.
arXiv Detail & Related papers (2020-10-06T05:01:38Z) - Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for
Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing.
The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds.
The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.