Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks
- URL: http://arxiv.org/abs/2502.21269v3
- Date: Wed, 29 Oct 2025 17:46:23 GMT
- Title: Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks
- Authors: Andrea Montanari, Pierfrancesco Urbani,
- Abstract summary: We study the learning dynamics of large two-layer neural networks via dynamical mean field theory.<n>For large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales.
- Score: 10.591718074748895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$~Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$~A dynamical decoupling between feature learning and overfitting regimes; $(iv)$~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.
Related papers
- On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective [1.0388986221727612]
Deep neural networks (DNNs) have significantly advanced machine learning, with model depth playing a central role in their successes.<n>In this work, we establish generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit.<n>Findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.
arXiv Detail & Related papers (2026-02-24T13:59:06Z) - Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration [67.12978375116599]
We show that the steps of gradient descent (GD) reduce to those of generalized perceptron algorithms.<n>This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.
arXiv Detail & Related papers (2025-12-12T14:16:35Z) - The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective [55.15192437680943]
We study the sample complexity of online reinforcement learning in the general setting of nonlinear dynamical systems with continuous state and action spaces.<n>Our algorithm achieves a policy regret of $mathcalO(N epsilon2 + mathrmln(m(epsilon)/epsilon2)$, where $epsilon$ is the time horizon.<n>In the special case where the dynamics are parametrized by a compact and real-valued set of parameters, we prove a policy regret of $mathcalO(sqrt
arXiv Detail & Related papers (2025-01-27T10:01:28Z) - When are dynamical systems learned from time series data statistically accurate? [2.2577735334028923]
We propose an ergodic theoretic approach to generalization of complex dynamical models learned from time series data.
Our main contribution is to define and analyze generalization of neural representations of classes of ergodic systems, including chaotic systems.
arXiv Detail & Related papers (2024-11-09T23:44:17Z) - Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics [21.55547541297847]
We study the problem of learning multi-index models in high-dimensions using a two-layer neural network trained with the mean-field Langevin algorithm.<n>Under mild distributional assumptions, we characterize the effective dimension $d_mathrmeff$ that controls both sample and computational complexity.
arXiv Detail & Related papers (2024-08-14T02:13:35Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data.
In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z) - Stretched and measured neural predictions of complex network dynamics [2.1024950052120417]
Data-driven approximations of differential equations present a promising alternative to traditional methods for uncovering a model of dynamical systems.
A recently employed machine learning tool for studying dynamics is neural networks, which can be used for data-driven solution finding or discovery of differential equations.
We show that extending the model's generalizability beyond traditional statistical learning theory limits is feasible.
arXiv Detail & Related papers (2023-01-12T09:44:59Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks.
Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.