$\alpha$-Stable convergence of heavy-tailed infinitely-wide neural
networks
- URL: http://arxiv.org/abs/2106.11064v1
- Date: Fri, 18 Jun 2021 01:36:41 GMT
- Title: $\alpha$-Stable convergence of heavy-tailed infinitely-wide neural
networks
- Authors: Paul Jung, Hoil Lee, Jiho Lee, and Hongseok Yang
- Abstract summary: infinitely-wide multi-layer perceptrons (MLPs) are limits of standard feed-forward neural networks.
We show that the vector of pre-activation values at all nodes of a given hidden layer converges in the limit.
- Score: 8.880921123362294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider infinitely-wide multi-layer perceptrons (MLPs) which are limits
of standard deep feed-forward neural networks. We assume that, for each layer,
the weights of an MLP are initialized with i.i.d. samples from either a
light-tailed (finite variance) or heavy-tailed distribution in the domain of
attraction of a symmetric $\alpha$-stable distribution, where $\alpha\in(0,2]$
may depend on the layer. For the bias terms of the layer, we assume i.i.d.
initializations with a symmetric $\alpha$-stable distribution having the same
$\alpha$ parameter of that layer. We then extend a recent result of Favaro,
Fortini, and Peluchetti (2020), to show that the vector of pre-activation
values at all nodes of a given hidden layer converges in the limit, under a
suitable scaling, to a vector of i.i.d. random variables with symmetric
$\alpha$-stable distributions.
Related papers
- Multilayer Correlation Clustering [12.492037397168579]
We establish Multilayer Correlation Clustering, a novel generalization of Correlation Clustering (Bansal et al., FOCS '02) to the multilayer setting.
In this paper, we are given a series of inputs of Correlation Clustering (called layers) over the common set $V$.
The goal is then to find a clustering of $V$ that minimizes the $ell_p$-norm ($pgeq 1$) of the disagreements vector.
arXiv Detail & Related papers (2024-04-25T15:25:30Z) - Depth Dependence of $\mu$P Learning Rates in ReLU MLPs [72.14317069090407]
We study the dependence on $n$ and $L$ of the maximal update ($mu$P) learning rate.
We find that it has a non-trivial dependence of $L$, scaling like $L-3/2.$
arXiv Detail & Related papers (2023-05-13T01:10:49Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - On the Accuracy of Hotelling-Type Tensor Deflation: A Random Tensor
Analysis [16.28927188636617]
A rank-$r$ asymmetric spiked model of the form $sum_i=1r beta_i A_i + W$ is considered.
We provide a study of Hotelling-type deflation in the large dimensional regime.
arXiv Detail & Related papers (2022-11-16T16:01:56Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Overparametrized linear dimensionality reductions: From projection
pursuit to two-layer neural networks [10.368585938419619]
Given a cloud of $n$ data points in $mathbbRd$, consider all projections onto $m$-dimensional subspaces of $mathbbRd$.
What does this collection of probability distributions look like when $n,d$ grow large?
Denoting by $mathscrF_m, alpha$ the set of probability distributions in $mathbbRm$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $mathscrF_
arXiv Detail & Related papers (2022-06-14T00:07:33Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient [90.14768299744792]
We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP.
We establish normality for the iteration averaged $barboldsymbolQ_T$.
In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
arXiv Detail & Related papers (2021-12-29T14:47:56Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.