Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit
- URL: http://arxiv.org/abs/2110.15596v1
- Date: Fri, 29 Oct 2021 07:53:35 GMT
- Title: Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit
- Authors: Karl Hajjar (LMO, CELESTE), L\'ena\"ic Chizat (LMO), Christophe Giraud
(LMO)
- Abstract summary: Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks.
For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights.
We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To theoretically understand the behavior of trained deep neural networks, it
is necessary to study the dynamics induced by gradient methods from a random
initialization. However, the nonlinear and compositional structure of these
models make these dynamics difficult to analyze. To overcome these challenges,
large-width asymptotics have recently emerged as a fruitful viewpoint and led
to practical insights on real-world deep networks. For two-layer neural
networks, it has been understood via these asymptotics that the nature of the
trained model radically changes depending on the scale of the initial random
weights, ranging from a kernel regime (for large initial variance) to a feature
learning regime (for small initial variance). For deeper networks more regimes
are possible, and in this paper we study in detail a specific choice of "small"
initialization corresponding to ''mean-field'' limits of neural networks, which
we call integrable parameterizations (IPs). First, we show that under standard
i.i.d. zero-mean initialization, integrable parameterizations of neural
networks with more than four layers start at a stationary point in the
infinite-width limit and no learning occurs. We then propose various methods to
avoid this trivial behavior and analyze in detail the resulting dynamics. In
particular, one of these methods consists in using large initial learning
rates, and we show that it is equivalent to a modification of the recently
proposed maximal update parameterization $\mu$P. We confirm our results with
numerical experiments on image classification tasks, which additionally show a
strong difference in behavior between various choices of activation functions
that is not yet captured by theory.
Related papers
- Topological obstruction to the training of shallow ReLU neural networks [0.0]
We study the interplay between the geometry of the loss landscape and the optimization trajectories of simple neural networks.
This paper reveals the presence of topological obstruction in the loss landscape of shallow ReLU neural networks trained using gradient flow.
arXiv Detail & Related papers (2024-10-18T19:17:48Z) - Robust Weight Initialization for Tanh Neural Networks with Fixed Point Analysis [5.016205338484259]
The proposed method is more robust to network size variations than the existing method.
When applied to Physics-Informed Neural Networks, the method exhibits faster convergence and robustness to variations of the network size.
arXiv Detail & Related papers (2024-10-03T06:30:27Z) - Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits.
Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss.
Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z) - Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training [1.7205106391379021]
A neural network with ReLU activations may be viewed as a composition of piecewise linear functions.
We introduce a novel training strategy that forces the network to exhibit a number of linear regions exponential in depth.
This approach allows us to learn approximations of convex, one-dimensional functions that are several orders of magnitude more accurate than their randomly counterparts.
arXiv Detail & Related papers (2023-11-29T19:09:48Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Simple initialization and parametrization of sinusoidal networks via
their kernel bandwidth [92.25666446274188]
sinusoidal neural networks with activations have been proposed as an alternative to networks with traditional activation functions.
We first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis.
We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth.
arXiv Detail & Related papers (2022-11-26T07:41:48Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - A Shooting Formulation of Deep Learning [19.51427735087011]
We introduce a shooting formulation which shifts the perspective from parameterizing a network layer-by-layer to parameterizing over optimal networks.
For scalability, we propose a novel particle-ensemble parametrization which fully specifies the optimal weight trajectory of the continuous-depth neural network.
arXiv Detail & Related papers (2020-06-18T07:36:04Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.