Related papers: ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!

ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!

URL: http://arxiv.org/abs/2202.09357v2
Date: Fri, 24 Mar 2023 11:39:47 GMT
Title: ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!
Authors: Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich and Peter Richt\'arik
Abstract summary: We introduce ProxSkip, a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth ($psi$) function. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices.
Score: 6.170898159041277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($\psi$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\mathcal{O}\left(\kappa \log \frac{1}{\varepsilon}\right)$, where $\kappa$ is the condition number of $f$, the number of prox evaluations is $\mathcal{O}\left(\sqrt{\kappa} \log \frac{1}{\varepsilon}\right)$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods, such as FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching, that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.

Related papers

Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z)
ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z)
Decomposable Non-Smooth Convex Optimization with Nearly-Linear Gradient Oracle Complexity [15.18055488087588]
We give an algorithm that minimizes the above convex formulation to $epsilon$-accuracy in $widetildeO(sum_i=1n d_i log (1 /epsilon))$ gradient computations. Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cutting-plane and interior-point methods.
arXiv Detail & Related papers (2022-08-07T20:53:42Z)
DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$. We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z)
Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity. We proposed a method to solve problems master to agents' communication and local calls. The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z)
The First Optimal Algorithm for Smooth and Strongly-Convex-Strongly-Concave Minimax Optimization [88.91190483500932]
In this paper, we revisit the smooth and strongly-strongly-concave minimax optimization problem. Existing state-of-the-art methods do not match lower bound $Omegaleft(sqrtkappa_xkappa_ylog3 (kappa_xkappa_y)logfrac1epsilonright)$. We fix this fundamental issue by providing the first algorithm with $mathcalOleft( sqrtkappa_xkappa_ylog
arXiv Detail & Related papers (2022-05-11T17:33:07Z)
Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity [70.65867695317633]
We propose two simple accelerated gradient methods, restarted gradient descent (AGD) and restarted ball (HB) method. We establish that our methods achieve an $frac1epsilon)$ number of gradient iterations. Our algorithms are simple in the sense that they only consist of Nestov's classical AGD orak's HB as well as a restart mechanism.
arXiv Detail & Related papers (2022-01-27T10:04:04Z)
A One-bit, Comparison-Based Gradient Estimator [29.600975900977343]
We show how one can use tools from one-bit compressed sensing to construct a robust and reliable estimator of the normalized gradient. We propose an algorithm, coined SCOBO, that uses this estimator within a gradient descent scheme.
arXiv Detail & Related papers (2020-10-06T05:01:38Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds. The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z)
Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem. We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.