The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at
Initialization
- URL: http://arxiv.org/abs/2206.02768v3
- Date: Wed, 14 Jun 2023 19:07:07 GMT
- Title: The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at
Initialization
- Authors: Mufan Bill Li, Mihai Nica, Daniel M. Roy
- Abstract summary: Recent work has shown that shaping the activation function as network depth grows large is necessary.
We identify the precise scaling of the activation function necessary to arrive at a nontrivial limit.
We recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
- Score: 13.872374586700767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The logit outputs of a feedforward neural network at initialization are
conditionally Gaussian, given a random covariance matrix defined by the
penultimate layer. In this work, we study the distribution of this random
matrix. Recent work has shown that shaping the activation function as network
depth grows large is necessary for this covariance matrix to be non-degenerate.
However, the current infinite-width-style understanding of this shaping method
is unsatisfactory for large depth: infinite-width analyses ignore the
microscopic fluctuations from layer to layer, but these fluctuations accumulate
over many layers.
To overcome this shortcoming, we study the random covariance matrix in the
shaped infinite-depth-and-width limit. We identify the precise scaling of the
activation function necessary to arrive at a non-trivial limit, and show that
the random covariance matrix is governed by a stochastic differential equation
(SDE) that we call the Neural Covariance SDE. Using simulations, we show that
the SDE closely matches the distribution of the random covariance matrix of
finite networks. Additionally, we recover an if-and-only-if condition for
exploding and vanishing norms of large shaped networks based on the activation
function.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Differentially Private Non-convex Learning for Multi-layer Neural
Networks [35.24835396398768]
This paper focuses on the problem of Differentially Private Tangent Optimization for (multi-layer) fully connected neural networks with a single output node.
By utilizing recent advances in Neural Kernel theory, we provide the first excess population risk when both the sample size and the width of the network are sufficiently large.
arXiv Detail & Related papers (2023-10-12T15:48:14Z) - The Shaped Transformer: Attention Models in the Infinite Depth-and-Width
Limit [38.89510345229949]
We study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width.
To achieve a well-defined limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity.
We show, through simulations, that the differential equation (SDE) indexed by the depth-to-width ratio provides a surprisingly good description of the corresponding finite-size model.
arXiv Detail & Related papers (2023-06-30T16:10:36Z) - The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse
Problems [64.29491112653905]
We propose a novel and efficient diffusion sampling strategy that synergistically combines the diffusion sampling and Krylov subspace methods.
Specifically, we prove that if tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG with the denoised data ensures the data consistency update to remain in the tangent space.
Our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method.
arXiv Detail & Related papers (2023-03-10T07:42:49Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Deep neural networks with dependent weights: Gaussian Process mixture
limit, heavy tails, sparsity and compressibility [18.531464406721412]
This article studies the infinite-width limit of deep feedforward neural networks whose weights are dependent.
Each hidden node of the network is assigned a nonnegative random variable that controls the variance of the outgoing weights of that node.
arXiv Detail & Related papers (2022-05-17T09:14:32Z) - Global convergence of ResNets: From finite to infinite width using
linear parameterization [0.0]
We study Residual Networks (ResNets) in which the residual block has linear parametrization while still being nonlinear.
In this limit, we prove a local Polyak-Lojasiewicz inequality, retrieving the lazy regime.
Our analysis leads to a practical and quantified recipe.
arXiv Detail & Related papers (2021-12-10T13:38:08Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.