The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods
- URL: http://arxiv.org/abs/2111.07167v1
- Date: Sat, 13 Nov 2021 18:29:36 GMT
- Title: The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods
- Authors: Nikhil Ghosh, Song Mei, Bin Yu
- Abstract summary: We consider the training dynamics of flow on least-squares objectives, which is a limiting dynamics of SGD trained neural networks.
We show that SGD progressively learns more complex functions and that there is a "deep gradient" phenomenon.
- Score: 14.294357100033954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To understand how deep learning works, it is crucial to understand the
training dynamics of neural networks. Several interesting hypotheses about
these dynamics have been made based on empirically observed phenomena, but
there exists a limited theoretical understanding of when and why such phenomena
occur.
In this paper, we consider the training dynamics of gradient flow on kernel
least-squares objectives, which is a limiting dynamics of SGD trained neural
networks. Using precise high-dimensional asymptotics, we characterize the
dynamics of the fitted model in two "worlds": in the Oracle World the model is
trained on the population distribution and in the Empirical World the model is
trained on a sampled dataset. We show that under mild conditions on the kernel
and $L^2$ target regression function the training dynamics undergo three stages
characterized by the behaviors of the models in the two worlds. Our theoretical
results also mathematically formalize some interesting deep learning phenomena.
Specifically, in our setting we show that SGD progressively learns more complex
functions and that there is a "deep bootstrap" phenomenon: during the second
stage, the test error of both worlds remain close despite the empirical
training error being much smaller. Finally, we give a concrete example
comparing the dynamics of two different kernels which shows that faster
training is not necessary for better generalization.
Related papers
- High-order expansion of Neural Ordinary Differential Equations flows [4.4569182855550755]
We introduce Event Transitions, a framework based on high-order differentials that provides a rigorous mathematical description of neural ODE dynamics on event gradient.
Our findings contribute to a deeper theoretical foundation for event-triggered neural differential equations and provide a mathematical construct for explaining complex system dynamics.
arXiv Detail & Related papers (2025-04-02T08:57:34Z) - On the Cone Effect in the Learning Dynamics [57.02319387815831]
We take an empirical perspective to study the learning dynamics of neural networks in real-world settings.
Our key findings reveal a two-phase learning process: i) in Phase I, the eNTK evolves significantly, signaling the rich regime, and ii) in Phase II, the eNTK keeps evolving but is constrained in a narrow space.
arXiv Detail & Related papers (2025-03-20T16:38:25Z) - Transformer Dynamics: A neuroscientific approach to interpretability of large language models [0.0]
We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers.
We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis.
In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers.
arXiv Detail & Related papers (2025-02-17T18:49:40Z) - Unified View of Grokking, Double Descent and Emergent Abilities: A
Perspective from Circuits Competition [83.13280812128411]
Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models.
We present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits.
arXiv Detail & Related papers (2024-02-23T08:14:36Z) - A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models [13.283281356356161]
We review the literature on statistical theories of neural networks from three perspectives.
Results on excess risks for neural networks are reviewed.
Papers that attempt to answer how the neural network finds the solution that can generalize well on unseen data'' are reviewed.
arXiv Detail & Related papers (2024-01-14T02:30:19Z) - A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer
Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed.
Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z) - Knowledge-based Deep Learning for Modeling Chaotic Systems [7.075125892721573]
This paper considers extreme events and their dynamics and proposes models based on deep neural networks, called knowledge-based deep learning (KDL)
Our proposed KDL can learn the complex patterns governing chaotic systems by jointly training on real and simulated data.
We validate our model by assessing it on three real-world benchmark datasets: El Nino sea surface temperature, San Juan Dengue viral infection, and Bjornoya daily precipitation.
arXiv Detail & Related papers (2022-09-09T11:46:25Z) - Decomposed Linear Dynamical Systems (dLDS) for learning the latent
components of neural dynamics [6.829711787905569]
We propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time series data.
Our model is trained through a dictionary learning procedure, where we leverage recent results in tracking sparse vectors over time.
In both continuous-time and discrete-time instructional examples we demonstrate that our model can well approximate the original system.
arXiv Detail & Related papers (2022-06-07T02:25:38Z) - Characterizing and overcoming the greedy nature of learning in
multi-modal deep neural networks [62.48782506095565]
We show that due to the greedy nature of learning in deep neural networks, models tend to rely on just one modality while under-fitting the other modalities.
We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning.
arXiv Detail & Related papers (2022-02-10T20:11:21Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Deep Active Learning by Leveraging Training Dynamics [57.95155565319465]
We propose a theory-driven deep active learning method (dynamicAL) which selects samples to maximize training dynamics.
We show that dynamicAL not only outperforms other baselines consistently but also scales well on large deep learning models.
arXiv Detail & Related papers (2021-10-16T16:51:05Z) - SGD Distributional Dynamics of Three Layer Neural Networks [7.025709586759655]
In paper, we seek to extend the mean field results of Mei et al. from two neural networks with one hidden layer to three neural networks with two hidden layers.
We will show that the SGD is captured by a set of non-linear differential equations, and prove that distributions of dynamics in the two layers are independent.
arXiv Detail & Related papers (2020-12-30T04:37:09Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.