Topology-aware Generalization of Decentralized SGD
- URL: http://arxiv.org/abs/2206.12680v2
- Date: Tue, 28 Jun 2022 12:40:49 GMT
- Title: Topology-aware Generalization of Decentralized SGD
- Authors: Tongtian Zhu, Fengxiang He, Lan Zhang, Zhengyang Niu, Mingli Song,
Dacheng Tao
- Abstract summary: This paper studies the generalizability of decentralized Valpha-10 stability descent (D-SGD)
We prove that the generalizability of D-SGD has a positive correlation with connectivity in initial training phase.
- Score: 89.25765221779288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the algorithmic stability and generalizability of
decentralized stochastic gradient descent (D-SGD). We prove that the consensus
model learned by D-SGD is $\mathcal{O}{(m/N+1/m+\lambda^2)}$-stable in
expectation in the non-convex non-smooth setting, where $N$ is the total sample
size of the whole system, $m$ is the worker number, and $1-\lambda$ is the
spectral gap that measures the connectivity of the communication topology.
These results then deliver an
$\mathcal{O}{(1/N+{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}+
m^{-\alpha})}/{N^{1-\frac{\alpha}{2}}})}$ in-average generalization bound,
which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to
vacuous as suggested by existing literature on the projected version of D-SGD.
Our theory indicates that the generalizability of D-SGD has a positive
correlation with the spectral gap, and can explain why consensus control in
initial training phase can ensure better generalization. Experiments of VGG-11
and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To
our best knowledge, this is the first work on the topology-aware generalization
of vanilla D-SGD. Code is available at
https://github.com/Raiden-Zhu/Generalization-of-DSGD.
Related papers
- On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - Removing Data Heterogeneity Influence Enhances Network Topology
Dependence of Decentralized SGD [15.112499553818953]
We study the non-asymotic convergence property of the D$2$/Exact-diffusion algorithm.
Compared with existing decentralized algorithms, D$2$/Exact-diffusion is least sensitive to network topology.
arXiv Detail & Related papers (2021-05-17T17:16:52Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Curse of Dimensionality on Randomized Smoothing for Certifiable
Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging.
We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.