On the infinite width limit of neural networks with a standard
parameterization
- URL: http://arxiv.org/abs/2001.07301v3
- Date: Sat, 18 Apr 2020 21:06:06 GMT
- Title: On the infinite width limit of neural networks with a standard
parameterization
- Authors: Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee
- Abstract summary: We propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity.
We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization.
- Score: 52.07828272324366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are currently two parameterizations used to derive fixed kernels
corresponding to infinite width neural networks, the NTK (Neural Tangent
Kernel) parameterization and the naive standard parameterization. However, the
extrapolation of both of these parameterizations to infinite width is
problematic. The standard parameterization leads to a divergent neural tangent
kernel while the NTK parameterization fails to capture crucial aspects of
finite width networks such as: the dependence of training dynamics on relative
layer widths, the relative training dynamics of weights and biases, and overall
learning rate scale. Here we propose an improved extrapolation of the standard
parameterization that preserves all of these properties as width is taken to
infinity and yields a well-defined neural tangent kernel. We show
experimentally that the resulting kernels typically achieve similar accuracy to
those resulting from an NTK parameterization, but with better correspondence to
the parameterization of typical finite width networks. Additionally, with
careful tuning of width parameters, the improved standard parameterization
kernels can outperform those stemming from an NTK parameterization. We release
code implementing this improved standard parameterization as part of the Neural
Tangents library at https://github.com/google/neural-tangents.
Related papers
- Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation [8.35644084613785]
We introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets.
By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients.
We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient.
arXiv Detail & Related papers (2024-11-04T11:38:27Z) - Sparse deep neural networks for nonparametric estimation in high-dimensional sparse regression [4.983567824636051]
This study combines nonparametric estimation and parametric sparse deep neural networks for the first time.
As nonparametric estimation of partial derivatives is of great significance for nonlinear variable selection, the current results show the promising future for the interpretability of deep neural networks.
arXiv Detail & Related papers (2024-06-26T07:41:41Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - Over-parameterised Shallow Neural Networks with Asymmetrical Node
Scaling: Global Convergence Guarantees and Feature Learning [23.47570704524471]
We consider optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter.
We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime.
arXiv Detail & Related papers (2023-02-02T10:40:06Z) - Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy.
To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN.
Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z) - Memorization and Optimization in Deep Neural Networks with Minimum
Over-parameterization [14.186776881154127]
The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks.
We show that the NTK is well conditioned in a challenging sub-linear setup.
Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks.
arXiv Detail & Related papers (2022-05-20T14:50:24Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features.
We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.