Related papers: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

URL: http://arxiv.org/abs/2010.15110v1
Date: Wed, 28 Oct 2020 17:53:01 GMT
Title: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
Authors: Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli
Abstract summary: We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process.
Score: 41.79250783277419
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

Related papers

Reactivation: Empirical NTK Dynamics Under Task Shifts [12.32540447018987]
The Neural Tangent Kernel (NTK) offers a powerful tool to study the functional dynamics of neural networks.<n>In the so-called lazy, or kernel regime, the NTK remains static during training and the network function is linear in the static neural tangents feature space.<n>The study of the NTK dynamics has led to several critical discoveries in recent years, in generalization and scaling behaviours.
arXiv Detail & Related papers (2025-07-21T20:13:02Z)
Connecting NTK and NNGP: A Unified Theoretical Framework for Wide Neural Network Learning Dynamics [6.349503549199403]
We provide a comprehensive framework for the learning process of deep wide neural networks. By characterizing the diffusive phase, our work sheds light on representational drift in the brain.
arXiv Detail & Related papers (2023-09-08T18:00:01Z)
Stabilizing RNN Gradients through Pre-training [3.335932527835653]
Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
arXiv Detail & Related papers (2023-08-23T11:48:35Z)
Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z)
Characterizing Learning Dynamics of Deep Neural Networks via Complex Networks [1.0869257688521987]
Complex Network Theory (CNT) represents Deep Neural Networks (DNNs) as directed weighted graphs to study them as dynamical systems. We introduce metrics for nodes/neurons and layers, namely Nodes Strength and Layers Fluctuation. Our framework distills trends in the learning dynamics and separates low from high accurate networks.
arXiv Detail & Related papers (2021-10-06T10:03:32Z)
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks. We explore the diversity of the neurons within the hidden layer during the learning process. We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z)
What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization. We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks. Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z)
What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training. Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z)
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning. We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
An analytic theory of shallow networks dynamics for hinge loss classification [14.323962459195771]
We study the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task. We specialize our theory to the prototypical case of a linearly separable dataset and a linear hinge loss. This allow us to address in a simple setting several phenomena appearing in modern networks such as slowing down of training dynamics, crossover between rich and lazy learning, and overfitting.
arXiv Detail & Related papers (2020-06-19T16:25:29Z)
On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization [18.424756271923524]
We study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs)
arXiv Detail & Related papers (2020-04-13T11:12:53Z)
A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy. We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.