Why (and When) does Local SGD Generalize Better than SGD?
- URL: http://arxiv.org/abs/2303.01215v1
- Date: Thu, 2 Mar 2023 12:56:52 GMT
- Title: Why (and When) does Local SGD Generalize Better than SGD?
- Authors: Xinran Gu, Kaifeng Lyu, Longbo Huang, Sanjeev Arora
- Abstract summary: Local SGD is a communication-efficient variant of SGD for large-scale training.
This paper aims to understand why (and when) Local SGD generalizes better based on Differential Equation (SDE) approximation.
- Score: 46.993699881100454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Local SGD is a communication-efficient variant of SGD for large-scale
training, where multiple GPUs perform SGD independently and average the model
parameters periodically. It has been recently observed that Local SGD can not
only achieve the design goal of reducing the communication overhead but also
lead to higher test accuracy than the corresponding SGD baseline (Lin et al.,
2020b), though the training regimes for this to happen are still in debate
(Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD
generalizes better based on Stochastic Differential Equation (SDE)
approximation. The main contributions of this paper include (i) the derivation
of an SDE that captures the long-term behavior of Local SGD in the small
learning rate regime, showing how noise drives the iterate to drift and diffuse
after it has reached close to the manifold of local minima, (ii) a comparison
between the SDEs of Local SGD and SGD, showing that Local SGD induces a
stronger drift term that can result in a stronger effect of regularization,
e.g., a faster reduction of sharpness, and (iii) empirical evidence validating
that having a small learning rate and long enough training time enables the
generalization improvement over SGD but removing either of the two conditions
leads to no improvement.
Related papers
- Stability and Generalization for Distributed SGDA [70.97400503482353]
We propose the stability-based generalization analytical framework for Distributed-SGDA.
We conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics.
Our theoretical results reveal the trade-off between the generalization gap and optimization error.
arXiv Detail & Related papers (2024-11-14T11:16:32Z) - The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice.
We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions.
We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - Local SGD Accelerates Convergence by Exploiting Second Order Information
of the Loss Function [1.7767466724342065]
Local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes.
In this paper, we offer a new perspective to understand the strength of L-SGD.
arXiv Detail & Related papers (2023-05-24T10:54:45Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Trade-offs of Local SGD at Scale: An Empirical Study [24.961068070560344]
We study a technique known as local SGD to reduce communication overhead.
We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy.
We also show that incorporating the slow momentum framework consistently improves accuracy without requiring additional communication.
arXiv Detail & Related papers (2021-10-15T15:00:42Z) - SGD with a Constant Large Learning Rate Can Converge to Local Maxima [4.014524824655106]
We construct worst-case optimization problems illustrating that gradient descent can exhibit strange and potentially undesirable behaviors.
Specifically, we construct landscapes and data distributions such that SGD converges to local maxima.
Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes.
arXiv Detail & Related papers (2021-07-25T10:12:18Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Is Local SGD Better than Minibatch SGD? [60.42437186984968]
We show how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD.
We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.
arXiv Detail & Related papers (2020-02-18T19:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.