Infinite-width limit of deep linear neural networks
- URL: http://arxiv.org/abs/2211.16980v1
- Date: Tue, 29 Nov 2022 18:37:14 GMT
- Title: Infinite-width limit of deep linear neural networks
- Authors: L\'ena\"ic Chizat, Maria Colombo, Xavier Fern\'andez-Real, Alessio
Figalli
- Abstract summary: We study the infinite-width limit of deep linear neural networks with random parameters.
We show that the linear predictors converge at an exponential rate to the minimal $ell$-norm minimizer of the risk.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the infinite-width limit of deep linear neural networks
initialized with random parameters. We obtain that, when the number of neurons
diverges, the training dynamics converge (in a precise sense) to the dynamics
obtained from a gradient descent on an infinitely wide deterministic linear
neural network. Moreover, even if the weights remain random, we get their
precise law along the training dynamics, and prove a quantitative convergence
result of the linear predictor in terms of the number of neurons.
We finally study the continuous-time limit obtained for infinitely wide
linear neural networks and show that the linear predictors of the neural
network converge at an exponential rate to the minimal $\ell_2$-norm minimizer
of the risk.
Related papers
- Proportional infinite-width infinite-depth limit for deep linear neural networks [0.16385815610837165]
We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer.
We explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs.
arXiv Detail & Related papers (2024-11-22T11:25:52Z) - How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence.
We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers.
This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with
Linear Convergence Rates [7.094295642076582]
Mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime.
We establish a new linear convergence result for two-layer neural networks trained by continuous-time noisy descent in the mean-field regime.
arXiv Detail & Related papers (2022-05-19T21:05:40Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Depth induces scale-averaging in overparameterized linear Bayesian
neural networks [20.864082353441685]
Inference in deep Bayesian neural networks is only fully understood in the infinite-width limit.
Here, we interpret finite deep linear Bayesian neural networks as data-dependent scale mixtures of Gaussian process predictors across output channels.
arXiv Detail & Related papers (2021-11-23T15:48:47Z) - Linear approximability of two-layer neural networks: A comprehensive
analysis based on spectral decay [4.042159113348107]
We first consider the case of single neuron and show that the linear approximability, quantified by the Kolmogorov width, is controlled by the eigenvalue decay of an associate kernel.
We show that similar results also hold for two-layer neural networks.
arXiv Detail & Related papers (2021-08-10T23:30:29Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.