Sharp Bounds for Federated Averaging (Local SGD) and Continuous
Perspective
- URL: http://arxiv.org/abs/2111.03741v1
- Date: Fri, 5 Nov 2021 22:16:11 GMT
- Title: Sharp Bounds for Federated Averaging (Local SGD) and Continuous
Perspective
- Authors: Margalit Glasgow, Honglin Yuan, Tengyu Ma
- Abstract summary: Federated AveragingFedAvg, also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL)
We show how to analyze this quantity from the Differential Equation (SDE) perspective.
- Score: 49.17352150219212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Federated Averaging (FedAvg), also known as Local SGD, is one of the most
popular algorithms in Federated Learning (FL). Despite its simplicity and
popularity, the convergence rate of FedAvg has thus far been undetermined. Even
under the simplest assumptions (convex, smooth, homogeneous, and bounded
covariance), the best-known upper and lower bounds do not match, and it is not
clear whether the existing analysis captures the capacity of the algorithm. In
this work, we first resolve this question by providing a lower bound for FedAvg
that matches the existing upper bound, which shows the existing FedAvg upper
bound analysis is not improvable. Additionally, we establish a lower bound in a
heterogeneous setting that nearly matches the existing upper bound. While our
lower bounds show the limitations of FedAvg, under an additional assumption of
third-order smoothness, we prove more optimistic state-of-the-art convergence
results in both convex and non-convex settings. Our analysis stems from a
notion we call iterate bias, which is defined by the deviation of the
expectation of the SGD trajectory from the noiseless gradient descent
trajectory with the same initialization. We prove novel sharp bounds on this
quantity, and show intuitively how to analyze this quantity from a Stochastic
Differential Equation (SDE) perspective.
Related papers
- A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - Closing the Gap Between the Upper Bound and the Lower Bound of Adam's
Iteration Complexity [51.96093077151991]
We derive a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption.
Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate.
arXiv Detail & Related papers (2023-10-27T09:16:58Z) - Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys.
Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z) - A Convergence Theory for Federated Average: Beyond Smoothness [28.074273047592065]
Federated learning enables a large amount of edge computing devices to learn a model without data sharing jointly.
As a leading algorithm in this setting, Federated Average FedAvg, which runs Gradient Descent (SGD) in parallel on local devices, has been widely used.
This paper provides a theoretical convergence study on Federated Learning.
arXiv Detail & Related papers (2022-11-03T04:50:49Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum
Optimization [1.6973426830397942]
We prove lower bounds for higher-order methods in smooth non finite-sum optimization.
We show that a pth-order regularized method cannot profit from the finitesum objective.
We show that a new second-order smoothness assumption can be seen as an analogue to the first-order mean-squared smoothness.
arXiv Detail & Related papers (2021-03-08T23:33:58Z) - Stochastic optimization with momentum: convergence, fluctuations, and
traps avoidance [0.0]
In this paper, a general optimization procedure is studied, unifying several variants of the gradient descent such as, among others, the heavy ball method, the Nesterov Accelerated Gradient (S-NAG), and the widely used Adam method.
The avoidance is studied as a noisy discretization of a non-autonomous ordinary differential equation.
arXiv Detail & Related papers (2020-12-07T19:14:49Z) - Convergence of Gradient Algorithms for Nonconvex $C^{1+\alpha}$ Cost
Functions [2.538209532048867]
We show a byproduct that a gradient converges and provide an explicit upper bound on the convergence rate.
This allows us to prove the almost sure convergence by Doobmartin's supergale convergence theorem.
arXiv Detail & Related papers (2020-12-01T16:48:59Z) - Linear Last-iterate Convergence in Constrained Saddle-point Optimization [48.44657553192801]
We significantly expand the understanding of last-rate uniqueness for Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU)
We show that when the equilibrium is unique, linear lastiterate convergence is achieved with a learning rate whose value is set to a universal constant.
We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption.
arXiv Detail & Related papers (2020-06-16T20:53:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.