A Two-Phase Perspective on Deep Learning Dynamics
- URL: http://arxiv.org/abs/2504.12700v1
- Date: Thu, 17 Apr 2025 06:57:37 GMT
- Title: A Two-Phase Perspective on Deep Learning Dynamics
- Authors: Robert de Mello Koch, Animik Ghosh,
- Abstract summary: We propose that learning in deep neural networks proceeds in two phases: a rapid curve fitting phase followed by a slower compression or coarse graining phase.<n>We empirically show that the associated timescales align in two rather different settings.<n>We argue that the second phase is not actively optimized by standard training algorithms and may be unnecessarily prolonged.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose that learning in deep neural networks proceeds in two phases: a rapid curve fitting phase followed by a slower compression or coarse graining phase. This view is supported by the shared temporal structure of three phenomena: grokking, double descent and the information bottleneck, all of which exhibit a delayed onset of generalization well after training error reaches zero. We empirically show that the associated timescales align in two rather different settings. Mutual information between hidden layers and input data emerges as a natural progress measure, complementing circuit-based metrics such as local complexity and the linear mapping number. We argue that the second phase is not actively optimized by standard training algorithms and may be unnecessarily prolonged. Drawing on an analogy with the renormalization group, we suggest that this compression phase reflects a principled form of forgetting, critical for generalization.
Related papers
- Learning in PINNs: Phase transition, total diffusion, and generalization [1.8802875123957965]
We investigate the learning dynamics of fully-connected neural networks through the lens of gradient signal-to-noise ratio (SNR)
We identify a third phase termed total diffusion"
We explore the information-induced compression phenomenon, pinpointing a significant compression of activations at the total diffusion phase.
arXiv Detail & Related papers (2024-03-27T12:10:30Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction.
We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z) - Intensity Profile Projection: A Framework for Continuous-Time
Representation Learning for Dynamic Networks [50.2033914945157]
We present a representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data.
The framework consists of three stages: estimating pairwise intensity functions, learning a projection which minimises a notion of intensity reconstruction error.
Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses.
arXiv Detail & Related papers (2023-06-09T15:38:25Z) - Deep Double Descent via Smooth Interpolation [2.141079906482723]
We quantify sharpness of fit of training data by studying the loss landscape w.r.t. to the input variable locally to each training point.
Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy targets.
While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, in contrast to existing intuition.
arXiv Detail & Related papers (2022-09-21T02:46:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Double Descent Optimization Pattern and Aliasing: Caveats of Noisy
Labels [1.4424394176890545]
This work confirms that double descent occurs with small datasets and noisy labels.
We show that increasing the learning rate can create analiasing effect that masks the double descent pattern without suppressing it.
We show that they translate to a real world application: the forecast of events in epileptic patients from continuous electroencephalographic recordings.
arXiv Detail & Related papers (2021-06-03T19:41:40Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - A Random Matrix Theory Approach to Damping in Deep Learning [0.7614628596146599]
We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise.
We develop a novel random matrix theory based damping learner for second order optimiser inspired by linear shrinkage estimation.
arXiv Detail & Related papers (2020-11-15T18:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.