FedPAGE: A Fast Local Stochastic Gradient Method for
Communication-Efficient Federated Learning
- URL: http://arxiv.org/abs/2108.04755v1
- Date: Tue, 10 Aug 2021 15:41:27 GMT
- Title: FedPAGE: A Fast Local Stochastic Gradient Method for
Communication-Efficient Federated Learning
- Authors: Haoyu Zhao, Zhize Li, Peter Richt\'arik
- Abstract summary: Averaging (FedAvg, also known as Local-SGD) is a classical federated learning in which clients run multiple local SGD steps before communicating their update to an orchestrating server.
We propose a new federated learning algorithm, FedAvg, able to reduce the communication complexity by utilizing the recent optimal PAGE method.
- Score: 21.055643409860743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Federated Averaging (FedAvg, also known as Local-SGD) (McMahan et al., 2017)
is a classical federated learning algorithm in which clients run multiple local
SGD steps before communicating their update to an orchestrating server. We
propose a new federated learning algorithm, FedPAGE, able to further reduce the
communication complexity by utilizing the recent optimal PAGE method (Li et
al., 2021) instead of plain SGD in FedAvg. We show that FedPAGE uses much fewer
communication rounds than previous local methods for both federated convex and
nonconvex optimization. Concretely, 1) in the convex setting, the number of
communication rounds of FedPAGE is $O(\frac{N^{3/4}}{S\epsilon})$, improving
the best-known result $O(\frac{N}{S\epsilon})$ of SCAFFOLD (Karimireddy et
al.,2020) by a factor of $N^{1/4}$, where $N$ is the total number of clients
(usually is very large in federated learning), $S$ is the sampled subset of
clients in each communication round, and $\epsilon$ is the target error; 2) in
the nonconvex setting, the number of communication rounds of FedPAGE is
$O(\frac{\sqrt{N}+S}{S\epsilon^2})$, improving the best-known result
$O(\frac{N^{2/3}}{S^{2/3}\epsilon^2})$ of SCAFFOLD (Karimireddy et al.,2020) by
a factor of $N^{1/6}S^{1/3}$, if the sampled clients $S\leq \sqrt{N}$. Note
that in both settings, the communication cost for each round is the same for
both FedPAGE and SCAFFOLD. As a result, FedPAGE achieves new state-of-the-art
results in terms of communication complexity for both federated convex and
nonconvex optimization.
Related papers
- Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding [27.138031759281745]
We present the first known federated rank aggregation methods using Borda scoring and Lehmer codes.
Our results represent the first rigorous analysis of Borda's method in centralized and distributed settings under the Mallows model.
arXiv Detail & Related papers (2024-09-01T21:36:09Z) - Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem.
To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm.
We prove that FedSupLinUCB achieves a total regret of $tildeO(sqrtd T)$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model.
arXiv Detail & Related papers (2023-11-02T03:41:58Z) - Solving a Class of Non-Convex Minimax Optimization in Federated Learning [84.98927714326908]
The minimax problems arise throughout machine learning applications, ranging from machine learning training to large-scale learning.
We propose a class of algorithms for non minimax problems (emphi) that reduce complexity to $varepsilon-6)$.
We prove that FedSGDA-M has the best sample complexity of $O(kappa2-3)$ and the best-known communication of $O(kappa2-3)$.
arXiv Detail & Related papers (2023-10-05T15:48:41Z) - An Efficient Stochastic Algorithm for Decentralized Nonconvex-Strongly-Concave Minimax Optimization [25.00475462213752]
Decentralized Recursive desc. Method (DREAM)
Concretely, it requires $mathcalO(minminappaappa3eps-3,kappa2 N)$ first-order oracle (SFO) calls and $tildemathcalO(kappa2 epsilon-2) communication rounds.
Our numerical experiments validate the superiority of previous methods.
arXiv Detail & Related papers (2022-12-05T16:09:39Z) - DASHA: Distributed Nonconvex Optimization with Communication
Compression, Optimal Oracle Complexity, and No Client Synchronization [77.34726150561087]
We develop and analyze DASHA: a new family of methods for noneps distributed optimization problems.
Unlike MARINA, the new methods DASHA, DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for learning.
arXiv Detail & Related papers (2022-02-02T20:10:40Z) - Federated Optimization of Smooth Loss Functions [35.944772011421264]
In this work, we study empirical risk minimization (ERM) within a federated learning framework.
We propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm.
Our method solves the ERM problem at the server using inexact gradient descent.
arXiv Detail & Related papers (2022-01-06T07:40:51Z) - Faster Rates for Compressed Federated Learning with Client-Variance
Reduction [23.169998435268504]
We show that COFIG and FRECON converge within $O(frac(1+omega)sqrtNSepsilon2)$ communication rounds.
In the convex setting COFIG converges within $O(frac(1+omega)sqrtNSepsilon2)$ communication rounds, which, to the best of our knowledge, is the first result.
arXiv Detail & Related papers (2021-12-24T16:28:18Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal
Sample and Communication Complexities for Federated Learning [58.6792963686231]
Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data.
It is not clear how to choose the WNs' minimum update directions, the first minibatch sizes, and the local update frequency.
We show that there is a trade-off curve between local update frequencies and local mini sizes, on which the above complexities can be maintained.
arXiv Detail & Related papers (2021-06-19T06:13:45Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Global Convergence of Frank Wolfe on One Hidden Layer Networks [121.96696298666014]
We derive global convergence bounds for the Frank Wolfe algorithm when training one hidden layer neural networks.
When using the ReLU activation function, and under tractable preconditioning assumptions on the sample data set, the linear minimization oracle used to incrementally form the solution can be solved explicitly as a second order cone program.
arXiv Detail & Related papers (2020-02-06T11:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.