Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2602.07378v1
- Date: Sat, 07 Feb 2026 05:50:44 GMT
- Title: Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent
- Authors: Shota Imai, Sota Nishiyama, Masaaki Imaizumi,
- Abstract summary: In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch gradient, then derive differential equations with different time scales.<n>The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs.<n>Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning.
- Score: 5.5165579223151795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The dynamics of gradient-based training in neural networks often exhibit nontrivial structures; hence, understanding them remains a central challenge in theoretical machine learning. In particular, a concept of feature unlearning, in which a neural network progressively loses previously learned features over long training, has gained attention. In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch stochastic gradient, then derive differential equations with different time scales, revealing the mechanism and conditions for feature unlearning to occur. Specifically, we utilize the fast-slow dynamics: while an alignment of first-layer weights develops rapidly, the second-layer weights develop slowly. The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs. We give numerical validation of the result, and derive theoretical grounding and scaling laws of the feature unlearning. Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning. Technically, our analysis utilizes Tensor Programs and the singular perturbation theory.
Related papers
- Explaining Grokking and Information Bottleneck through Neural Collapse Emergence [33.22494588674352]
We present a unified explanation of grokking and the information bottleneck principle through the lens of neural collapse.<n>We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck.<n>By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena.
arXiv Detail & Related papers (2025-09-25T07:17:41Z) - A Dynamical Systems Perspective on the Analysis of Neural Networks [0.0]
We utilize dynamical systems to analyze several aspects of machine learning algorithms.<n>We demonstrate how to re-formulate a variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements.
arXiv Detail & Related papers (2025-07-07T16:18:49Z) - Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks [10.591718074748895]
We study the learning dynamics of large two-layer neural networks via dynamical mean field theory.<n>For large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales.
arXiv Detail & Related papers (2025-02-28T17:45:26Z) - A ghost mechanism: An analytical model of abrupt learning [6.509233267425589]
We show how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations.<n>Our model reveals a bifurcation-free mechanism for abrupt learning and illustrates the importance of both deliberate uncertainty and redundancy in stabilizing learning dynamics.
arXiv Detail & Related papers (2025-01-04T20:49:20Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Connecting NTK and NNGP: A Unified Theoretical Framework for Wide Neural Network Learning Dynamics [6.349503549199403]
We provide a comprehensive framework for the learning process of deep wide neural networks.<n>By characterizing the diffusive phase, our work sheds light on representational drift in the brain.
arXiv Detail & Related papers (2023-09-08T18:00:01Z) - Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training.
We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z) - Understanding and mitigating gradient pathologies in physics-informed
neural networks [2.1485350418225244]
This work focuses on the effectiveness of physics-informed neural networks in predicting outcomes of physical systems and discovering hidden physics from noisy data.
We present a learning rate annealing algorithm that utilizes gradient statistics during model training to balance the interplay between different terms in composite loss functions.
We also propose a novel neural network architecture that is more resilient to such gradient pathologies.
arXiv Detail & Related papers (2020-01-13T21:23:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.