Feature Learning in Infinite-Width Neural Networks
- URL: http://arxiv.org/abs/2011.14522v2
- Date: Tue, 11 May 2021 08:04:47 GMT
- Title: Feature Learning in Infinite-Width Neural Networks
- Authors: Greg Yang, Edward J. Hu
- Abstract summary: We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features.
We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
- Score: 17.309380337367536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As its width tends to infinity, a deep neural network's behavior under
gradient descent can become simplified and predictable (e.g. given by the
Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK
parametrization). However, we show that the standard and NTK parametrizations
of a neural network do not admit infinite-width limits that can learn features,
which is crucial for pretraining and transfer learning such as with BERT. We
propose simple modifications to the standard parametrization to allow for
feature learning in the limit. Using the *Tensor Programs* technique, we derive
explicit formulas for such limits. On Word2Vec and few-shot learning on
Omniglot via MAML, two canonical tasks that rely crucially on feature learning,
we compute these limits exactly. We find that they outperform both NTK
baselines and finite-width networks, with the latter approaching the
infinite-width feature learning performance as width increases.
More generally, we classify a natural space of neural network
parametrizations that generalizes standard, NTK, and Mean Field
parametrizations. We show 1) any parametrization in this space either admits
feature learning or has an infinite-width training dynamics given by kernel
gradient descent, but not both; 2) any such infinite-width limit can be
computed using the Tensor Programs technique. Code for our experiments can be
found at github.com/edwardjhu/TP4.
Related papers
- Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks [42.14352997147652]
We investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets)
In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$mu$P.
We find that Depth-$mu$P can be characterized as maximizing both feature learning and feature diversity.
arXiv Detail & Related papers (2023-10-03T17:50:40Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy.
To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN.
Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z) - Fast Finite Width Neural Tangent Kernel [47.57136433797996]
The neural network Jacobian has emerged as a central object of study in deep learning.
The finite width NTK is notoriously expensive to compute.
We propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK.
arXiv Detail & Related papers (2022-06-17T12:18:22Z) - Memorization and Optimization in Deep Neural Networks with Minimum
Over-parameterization [14.186776881154127]
The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks.
We show that the NTK is well conditioned in a challenging sub-linear setup.
Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks.
arXiv Detail & Related papers (2022-05-20T14:50:24Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On the infinite width limit of neural networks with a standard
parameterization [52.07828272324366]
We propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity.
We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization.
arXiv Detail & Related papers (2020-01-21T01:02:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.