Phase Transitions between Accuracy Regimes in L2 regularized Deep Neural Networks
- URL: http://arxiv.org/abs/2505.06597v2
- Date: Wed, 27 Aug 2025 21:11:13 GMT
- Title: Phase Transitions between Accuracy Regimes in L2 regularized Deep Neural Networks
- Authors: Ibrahim Talha Ersoy, Karoline Wiesner,
- Abstract summary: Increasing the L2 regularization of Deep Neural Networks (DNNs) causes a first-order phase transition into the under-parametrized phase.<n>We predict new transition points as the data complexity is increased and, in accordance with the theory of phase transitions, the existence of effects.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Increasing the L2 regularization of Deep Neural Networks (DNNs) causes a first-order phase transition into the under-parametrized phase -- the so-called onset-of learning. We explain this transition via the scalar (Ricci) curvature of the error landscape. We predict new transition points as the data complexity is increased and, in accordance with the theory of phase transitions, the existence of hysteresis effects. We confirm both predictions numerically. Our results provide a natural explanation of the recently discovered phenomenon of '\emph{grokking}' as DNN models getting stuck in a local minimum of the error surface, corresponding to a lower accuracy phase. Our work paves the way for new probing methods of the intrinsic structure of DNNs in and beyond the L2 context.
Related papers
- Phase transitions reveal hierarchical structure in deep neural networks [0.0]
We show that phase transitions in Deep Neural Networks are governed by saddle points in the loss landscape.<n>We introduce a simple, fast, and easy to implement algorithm that uses the L2 regularizer as a tool to probe the geometry of error landscapes.
arXiv Detail & Related papers (2025-12-05T15:14:09Z) - Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data [52.737775129027575]
We show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits Neural Collapse (NC)<n>We reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.
arXiv Detail & Related papers (2025-10-24T01:36:19Z) - Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z) - New Evidence of the Two-Phase Learning Dynamics of Neural Networks [59.55028392232715]
We introduce an interval-wise perspective that compares network states across a time window.<n>We show that the response of the network to a perturbation exhibits a transition from chaotic to stable.<n>We also find that after this transition point the model's functional trajectory is confined to a narrow cone-shaped subset.
arXiv Detail & Related papers (2025-05-20T04:03:52Z) - ProPINN: Demystifying Propagation Failures in Physics-Informed Neural Networks [71.02216400133858]
Physics-informed neural networks (PINNs) have earned high expectations in solving partial differential equations (PDEs)<n>Previous research observed the propagation failure phenomenon of PINNs.<n>This paper provides a formal and in-depth study of propagation failure and its root cause.
arXiv Detail & Related papers (2025-02-02T13:56:38Z) - Phase Transitions in Large Language Models and the $O(N)$ Model [0.0]
We reformulated the Transformer architecture as an $O(N)$ model to investigate phase transitions in large language models.<n>Our study reveals two distinct phase transitions corresponding to the temperature used in text generation.<n>As an application, the energy of the $O(N)$ model can be used to evaluate whether an LLM's parameters are sufficient to learn the training data.
arXiv Detail & Related papers (2025-01-27T17:36:06Z) - Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization [41.20978920228298]
We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize.
We also show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors.
Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.
arXiv Detail & Related papers (2024-06-12T21:33:22Z) - Learning in PINNs: Phase transition, total diffusion, and generalization [1.8802875123957965]
We investigate the learning dynamics of fully-connected neural networks through the lens of gradient signal-to-noise ratio (SNR)
We identify a third phase termed total diffusion"
We explore the information-induced compression phenomenon, pinpointing a significant compression of activations at the total diffusion phase.
arXiv Detail & Related papers (2024-03-27T12:10:30Z) - On the Emergence of Cross-Task Linearity in the Pretraining-Finetuning Paradigm [47.55215041326702]
We discover an intriguing linear phenomenon in models that are from a common pretrained checkpoint and finetuned on different tasks, termed as Cross-Task Linearity (CTL)
We show that if we linearly interpolate the weights of two finetuned models, the features in the weight-interpolated model are often approximately equal to the linearities of features in two finetuned models at each layer.
We conjecture that in the pretraining-finetuning paradigm, neural networks approximately function as linear maps, mapping from the parameter space to the feature space.
arXiv Detail & Related papers (2024-02-06T03:28:36Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Reparameterization through Spatial Gradient Scaling [69.27487006953852]
Reparameterization aims to improve the generalization of deep neural networks by transforming convolutional layers into equivalent multi-branched structures during training.
We present a novel spatial gradient scaling method to redistribute learning focus among weights in convolutional networks.
arXiv Detail & Related papers (2023-03-05T17:57:33Z) - An Adaptive and Stability-Promoting Layerwise Training Approach for Sparse Deep Neural Network Architecture [0.0]
This work presents a two-stage adaptive framework for developing deep neural network (DNN) architectures that generalize well for a given training data set.
In the first stage, a layerwise training approach is adopted where a new layer is added each time and trained independently by freezing parameters in the previous layers.
We introduce a epsilon-delta stability-promoting concept as a desirable property for a learning algorithm and show that employing manifold regularization yields a epsilon-delta stability-promoting algorithm.
arXiv Detail & Related papers (2022-11-13T09:51:16Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Mixed Graph Contrastive Network for Semi-Supervised Node Classification [63.924129159538076]
We propose a novel graph contrastive learning method, termed Mixed Graph Contrastive Network (MGCN)<n>In our method, we improve the discriminative capability of the latent embeddings by an unperturbed augmentation strategy and a correlation reduction mechanism.<n>By combining the two settings, we extract rich supervision information from both the abundant nodes and the rare yet valuable labeled nodes for discriminative representation learning.
arXiv Detail & Related papers (2022-06-06T14:26:34Z) - Exact Phase Transitions in Deep Learning [5.33024001730262]
We prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer.
The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.
arXiv Detail & Related papers (2022-05-25T06:00:34Z) - Phase diagram of Stochastic Gradient Descent in high-dimensional
two-layer neural networks [22.823904789355495]
We investigate the connection between the mean-fieldhydrodynamic regime and the seminal approach of Saad & Solla.
Our work builds on a deterministic description of rates in high-dimensionals from statistical physics.
arXiv Detail & Related papers (2022-02-01T09:45:07Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Dual-constrained Deep Semi-Supervised Coupled Factorization Network with
Enriched Prior [80.5637175255349]
We propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net.
To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network.
Our network can obtain state-of-the-art performance for representation learning and clustering.
arXiv Detail & Related papers (2020-09-08T13:10:21Z) - Phase diagram for two-layer ReLU neural networks at infinite-width limit [6.380166265263755]
We draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit.
We identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime.
In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay.
In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations.
arXiv Detail & Related papers (2020-07-15T06:04:35Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.