The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial
- URL: http://arxiv.org/abs/2404.19719v1
- Date: Tue, 30 Apr 2024 17:11:12 GMT
- Title: The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial
- Authors: Dhruva Karkada,
- Abstract summary: A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics.
In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom.
This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics. Theoretical analyses of these overparameterized models have recently centered around studying very wide neural networks. In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights. This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning in the so-called $\mu$P regime. In this paper, we explain this richness scale, synthesize recent research results into a coherent whole, offer new perspectives and intuitions, and provide empirical evidence supporting our claims. In doing so, we hope to encourage further study of the richness scale, as it may be key to developing a scientific theory of feature learning in practical deep neural networks.
Related papers
- Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits.
Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss.
Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z) - Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Joint Feature and Differentiable $ k $-NN Graph Learning using Dirichlet
Energy [103.74640329539389]
We propose a deep FS method that simultaneously conducts feature selection and differentiable $ k $-NN graph learning.
We employ Optimal Transport theory to address the non-differentiability issue of learning $ k $-NN graphs in neural networks.
We validate the effectiveness of our model with extensive experiments on both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-21T08:15:55Z) - What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness? [0.0]
We study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods.
We show how NTKs allow to generate adversarial examples in a training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the lazy'' regime.
arXiv Detail & Related papers (2022-10-11T16:11:48Z) - Neural Capacitance: A New Perspective of Neural Network Selection via
Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction.
We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training.
Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z) - What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization.
We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks.
Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.