Dynamically Stable Infinite-Width Limits of Neural Classifiers
- URL: http://arxiv.org/abs/2006.06574v2
- Date: Thu, 22 Oct 2020 12:56:20 GMT
- Title: Dynamically Stable Infinite-Width Limits of Neural Classifiers
- Authors: Eugene A. Golikov
- Abstract summary: We propose a general framework to study how the limit behavior of neural models depends on the scaling of hyper parameters with network width.
Existing MF and NTK limit models, as well as one novel limit model, satisfy most of the properties demonstrated by finite-width models.
- Score: 6.09170287691728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has been focused on two different approaches to studying
neural networks training in the limit of infinite width (1) a mean-field (MF)
and (2) a constant neural tangent kernel (NTK) approximations. These two
approaches have different scaling of hyperparameters with the width of a
network layer and as a result, different infinite-width limit models. We
propose a general framework to study how the limit behavior of neural models
depends on the scaling of hyperparameters with network width. Our framework
allows us to derive scaling for existing MF and NTK limits, as well as an
uncountable number of other scalings that lead to a dynamically stable limit
behavior of corresponding models. However, only a finite number of distinct
limit models are induced by these scalings. Each distinct limit model
corresponds to a unique combination of such properties as boundedness of logits
and tangent kernels at initialization or stationarity of tangent kernels.
Existing MF and NTK limit models, as well as one novel limit model, satisfy
most of the properties demonstrated by finite-width models. We also propose a
novel initialization-corrected mean-field limit that satisfies all properties
noted above, and its corresponding model is a simple modification for a
finite-width model.
Related papers
- On the Neural Tangent Kernel of Equilibrium Models [72.29727250679477]
This work studies the neural tangent kernel (NTK) of the deep equilibrium (DEQ) model.
We show that contrarily a DEQ model still enjoys a deterministic NTK despite its width and depth going to infinity at the same time under mild conditions.
arXiv Detail & Related papers (2023-10-21T16:47:18Z) - A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer
Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed.
Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks.
For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights.
We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z) - On the Generalization Power of Overfitted Two-Layer Neural Tangent
Kernel Models [42.72822331030195]
min $ell$-norm overfitting solutions for the neural tangent kernel (NTK) model of a two-layer neural network.
We show that, depending on the ground-truth function, the test error of overfitted NTK models exhibits characteristics that are different from the "double-descent"
For functions outside of this class, we provide a lower bound on the generalization error that does not diminish to zero even when $n$ and $p$ are both large.
arXiv Detail & Related papers (2021-03-09T06:24:59Z) - Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve
Optimism, Embrace Virtual Curvature [61.22680308681648]
We show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward.
For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL)
arXiv Detail & Related papers (2021-02-08T12:41:56Z) - Feature Learning in Infinite-Width Neural Networks [17.309380337367536]
We show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features.
We propose simple modifications to the standard parametrization to allow for feature learning in the limit.
arXiv Detail & Related papers (2020-11-30T03:21:05Z) - A Dynamical Central Limit Theorem for Shallow Neural Networks [48.66103132697071]
We prove that the fluctuations around the mean limit remain bounded in mean square throughout training.
If the mean-field dynamics converges to a measure that interpolates the training data, we prove that the deviation eventually vanishes in the CLT scaling.
arXiv Detail & Related papers (2020-08-21T18:00:50Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.