Heavy-Tail Phenomenon in Decentralized SGD
- URL: http://arxiv.org/abs/2205.06689v2
- Date: Mon, 16 May 2022 14:31:35 GMT
- Title: Heavy-Tail Phenomenon in Decentralized SGD
- Authors: Mert Gurbuzbalaban, Yuanhan Hu, Umut Simsekli, Kun Yuan, Lingjiong Zhu
- Abstract summary: We study the emergence of heavy-tails in decentralized gradient descent (DE-SGD)
We also investigate the effect of decentralization on the tail behavior.
Our theory uncovers an interesting interplay between the tails and the network structure.
- Score: 33.63000461985398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent theoretical studies have shown that heavy-tails can emerge in
stochastic optimization due to `multiplicative noise', even under surprisingly
simple settings, such as linear regression with Gaussian data. While these
studies have uncovered several interesting phenomena, they consider
conventional stochastic optimization problems, which exclude decentralized
settings that naturally arise in modern machine learning applications. In this
paper, we study the emergence of heavy-tails in decentralized stochastic
gradient descent (DE-SGD), and investigate the effect of decentralization on
the tail behavior. We first show that, when the loss function at each
computational node is twice continuously differentiable and strongly convex
outside a compact region, the law of the DE-SGD iterates converges to a
distribution with polynomially decaying (heavy) tails. To have a more explicit
control on the tail exponent, we then consider the case where the loss at each
node is a quadratic, and show that the tail-index can be estimated as a
function of the step-size, batch-size, and the topological properties of the
network of the computational nodes. Then, we provide theoretical and empirical
results showing that DE-SGD has heavier tails than centralized SGD. We also
compare DE-SGD to disconnected SGD where nodes distribute the data but do not
communicate. Our theory uncovers an interesting interplay between the tails and
the network structure: we identify two regimes of parameters (stepsize and
network size), where DE-SGD can have lighter or heavier tails than disconnected
SGD depending on the regime. Finally, to support our theoretical results, we
provide numerical experiments conducted on both synthetic data and neural
networks.
Related papers
- On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient
Descent [33.9917975060585]
We show that the stationary distribution of offline (also called multi-pass) SGD exhibits 'approximate' power-law tails.
Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly 'power-law-like'
arXiv Detail & Related papers (2023-10-27T20:06:03Z) - Differentially Private Non-convex Learning for Multi-layer Neural
Networks [35.24835396398768]
This paper focuses on the problem of Differentially Private Tangent Optimization for (multi-layer) fully connected neural networks with a single output node.
By utilizing recent advances in Neural Kernel theory, we provide the first excess population risk when both the sample size and the width of the network are sufficiently large.
arXiv Detail & Related papers (2023-10-12T15:48:14Z) - Law of Balance and Stationary Distribution of Stochastic Gradient
Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery [97.79015388276483]
A structural equation model (SEM) is an effective framework to reason over causal relationships represented via a directed acyclic graph (DAG)
Recent advances enabled effective maximum-likelihood point estimation of DAGs from observational data.
We propose BCD Nets, a variational framework for estimating a distribution over DAGs characterizing a linear-Gaussian SEM.
arXiv Detail & Related papers (2021-12-06T03:35:21Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.