The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks
- URL: http://arxiv.org/abs/2006.14599v1
- Date: Thu, 25 Jun 2020 17:42:49 GMT
- Title: The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks
- Authors: Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington
- Abstract summary: In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
- Score: 43.860358308049044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern neural networks are often regarded as complex black-box functions
whose behavior is difficult to understand owing to their nonlinear dependence
on the data and the nonconvexity in their loss landscapes. In this work, we
show that these common perceptions can be completely false in the early phase
of learning. In particular, we formally prove that, for a class of well-behaved
input distributions, the early-time learning dynamics of a two-layer
fully-connected neural network can be mimicked by training a simple linear
model on the inputs. We additionally argue that this surprising simplicity can
persist in networks with more layers and with convolutional architecture, which
we verify empirically. Key to our analysis is to bound the spectral norm of the
difference between the Neural Tangent Kernel (NTK) at initialization and an
affine transform of the data kernel; however, unlike many previous results
utilizing the NTK, we do not require the network to have disproportionately
large width, and the network is allowed to escape the kernel regime later in
training.
Related papers
- Early learning of the optimal constant solution in neural networks and humans [4.016584525313835]
We show that learning of a target function is preceded by an early phase in which networks learn the optimal constant solution (OCS)
We show that learning of the OCS can emerge even in the absence of bias terms and is equivalently driven by generic correlations in the input data.
Our work suggests the OCS as a universal learning principle in supervised, error-corrective learning.
arXiv Detail & Related papers (2024-06-25T11:12:52Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization.
We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks.
Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z) - An analytic theory of shallow networks dynamics for hinge loss
classification [14.323962459195771]
We study the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task.
We specialize our theory to the prototypical case of a linearly separable dataset and a linear hinge loss.
This allow us to address in a simple setting several phenomena appearing in modern networks such as slowing down of training dynamics, crossover between rich and lazy learning, and overfitting.
arXiv Detail & Related papers (2020-06-19T16:25:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.