Phase diagram for two-layer ReLU neural networks at infinite-width limit
- URL: http://arxiv.org/abs/2007.07497v2
- Date: Tue, 13 Oct 2020 05:35:06 GMT
- Title: Phase diagram for two-layer ReLU neural networks at infinite-width limit
- Authors: Tao Luo, Zhi-Qin John Xu, Zheng Ma, Yaoyu Zhang
- Abstract summary: We draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit.
We identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime.
In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay.
In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations.
- Score: 6.380166265263755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How neural network behaves during the training over different choices of
hyperparameters is an important question in the study of neural networks. In
this work, inspired by the phase diagram in statistical mechanics, we draw the
phase diagram for the two-layer ReLU neural network at the infinite-width limit
for a complete characterization of its dynamical regimes and their dependence
on hyperparameters related to initialization. Through both experimental and
theoretical approaches, we identify three regimes in the phase diagram, i.e.,
linear regime, critical regime and condensed regime, based on the relative
change of input weights as the width approaches infinity, which tends to $0$,
$O(1)$ and $+\infty$, respectively. In the linear regime, NN training dynamics
is approximately linear similar to a random feature model with an exponential
loss decay. In the condensed regime, we demonstrate through experiments that
active neurons are condensed at several discrete orientations. The critical
regime serves as the boundary between above two regimes, which exhibits an
intermediate nonlinear behavior with the mean-field model as a typical example.
Overall, our phase diagram for the two-layer ReLU NN serves as a map for the
future studies and is a first step towards a more systematical investigation of
the training behavior and the implicit regularization of NNs of different
structures.
Related papers
- Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer
Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed.
Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z) - Empirical Phase Diagram for Three-layer Neural Networks with Infinite
Width [5.206156813130247]
We make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width.
For both synthetic datasets and real datasets, we find that the dynamics of each layer could be divided into a linear regime and a condensed regime.
In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity.
arXiv Detail & Related papers (2022-05-24T14:27:31Z) - Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks.
For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights.
We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z) - The Quenching-Activation Behavior of the Gradient Descent Dynamics for
Two-layer Neural Network Models [12.865834066050427]
gradient descent algorithm for training two-layer neural network models is studied.
Two distinctive phases in the dynamic behavior of GD in the under-parametrized regime are studied.
The quenching-activation process seems to provide a clear mechanism for "implicit regularization"
arXiv Detail & Related papers (2020-06-25T14:41:53Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.