MLPs at the EOC: Dynamics of Feature Learning
- URL: http://arxiv.org/abs/2502.13110v1
- Date: Tue, 18 Feb 2025 18:23:33 GMT
- Title: MLPs at the EOC: Dynamics of Feature Learning
- Authors: Dávid Terjék,
- Abstract summary: We propose a theory to explain the convergence of gradient descent and the learning of features along the way.
Such a theory should also cover phenomena observed by practicioners including the Edge of Stability (EOS) and the catapult mechanism.
- Score: 8.430481660019451
- License:
- Abstract: Since infinitely wide neural networks in the kernel regime are random feature models, the success of contemporary deep learning lies in the rich regime, where a satisfying theory should explain not only the convergence of gradient descent but the learning of features along the way. Such a theory should also cover phenomena observed by practicioners including the Edge of Stability (EOS) and the catapult mechanism. For a practically relevant theory in the limit, neural network parameterizations have to efficiently reproduce limiting behavior as width and depth are scaled up. While widthwise scaling is mostly settled, depthwise scaling is solved only at initialization by the Edge of Chaos (EOC). During training, scaling up depth is either done by inversely scaling the learning rate or adding residual connections. We propose $(1)$ the Normalized Update Parameterization ($\nu$P) to solve this issue by growing hidden layer sizes depthwise inducing the regularized evolution of preactivations, $(2)$ a hypothetical explanation for feature learning via the cosine of new and cumulative parameter updates and $(3)$ a geometry-aware learning rate schedule that is able to prolong the catapult phase indefinitely. We support our hypotheses and demonstrate the usefulness of $\nu$P and the learning rate schedule by empirical evidence.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits.
Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss.
Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Understanding Deep Neural Function Approximation in Reinforcement
Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL)
We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces.
Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z) - Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit [36.17720004582283]
This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
arXiv Detail & Related papers (2022-07-18T17:55:05Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Learning Rates as a Function of Batch Size: A Random Matrix Theory
Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.
We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks.
We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.