Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent
- URL: http://arxiv.org/abs/2306.02913v5
- Date: Thu, 9 Nov 2023 13:15:02 GMT
- Title: Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent
- Authors: Tongtian Zhu, Fengxiang He, Kaixuan Chen, Mingli Song, Dacheng Tao
- Abstract summary: Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
- Score: 101.37242096601315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decentralized stochastic gradient descent (D-SGD) allows collaborative
learning on massive devices simultaneously without the control of a central
server. However, existing theories claim that decentralization invariably
undermines generalization. In this paper, we challenge the conventional belief
and present a completely new perspective for understanding decentralized
learning. We prove that D-SGD implicitly minimizes the loss function of an
average-direction Sharpness-aware minimization (SAM) algorithm under general
non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence
reveals an intrinsic regularization-optimization trade-off and three advantages
of decentralization: (1) there exists a free uncertainty evaluation mechanism
in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient
smoothing effect; and (3) the sharpness regularization effect of D-SGD does not
decrease as total batch size increases, which justifies the potential
generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch
scenarios. The code is available at
https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM.
Related papers
- Stability and Generalization of the Decentralized Stochastic Gradient
Descent Ascent Algorithm [80.94861441583275]
We investigate the complexity of the generalization bound of the decentralized gradient descent (D-SGDA) algorithm.
Our results analyze the impact of different top factors on the generalization of D-SGDA.
We also balance it with the generalization to obtain the optimal convex-concave setting.
arXiv Detail & Related papers (2023-10-31T11:27:01Z) - DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over
Graphs [54.08445874064361]
We propose to solve a regularized distributionally robust learning problem in the decentralized setting.
By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust problem.
We show that our proposed algorithm can improve the worst distribution test accuracy by up to $10%$.
arXiv Detail & Related papers (2022-08-29T18:01:42Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Heavy-Tail Phenomenon in Decentralized SGD [33.63000461985398]
We study the emergence of heavy-tails in decentralized gradient descent (DE-SGD)
We also investigate the effect of decentralization on the tail behavior.
Our theory uncovers an interesting interplay between the tails and the network structure.
arXiv Detail & Related papers (2022-05-13T14:47:04Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Decentralized Stochastic Proximal Gradient Descent with Variance
Reduction over Time-varying Networks [30.231314171218994]
In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives.
We propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique.
arXiv Detail & Related papers (2021-12-20T08:23:36Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.