Related papers: $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

URL: http://arxiv.org/abs/2509.21519v2
Date: Mon, 29 Sep 2025 17:29:44 GMT
Title: $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization
Authors: Yuandong Tian,
Abstract summary: We propose a novel framework to characterize what kind of features will emerge, how and in which conditions it happens from training, for complex structured inputs.<n>We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks.
Score: 44.614763110719274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework to characterize what kind of features will emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) \underline{\textbf{L}}azy learning, (II) \underline{\textbf{i}}ndependent feature learning and (III) \underline{\textbf{i}}nteractive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. Thanks to lazy learning and weight decay, the \emph{backpropagated gradient} $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation \emph{independently}. Interestingly, the independent dynamics follows exactly the \emph{gradient ascent} of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.

Related papers

CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions [36.41201675940166]
We introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations.<n>We propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG.<n>CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes.
arXiv Detail & Related papers (2026-02-02T09:16:16Z)
Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit [66.20349460098275]
We study the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbolx)=g(boldsymbolUboldsymbolx)$ with hidden subspace $boldsymbolUin mathbbRrtimes d$.<n>We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error.
arXiv Detail & Related papers (2025-11-19T04:46:47Z)
Beyond Softmax: A Natural Parameterization for Categorical Random Variables [61.709831225296305]
We introduce the $textitcatnat$ function, a function composed of a sequence of hierarchical binary splits.<n>A rich set of experiments show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.
arXiv Detail & Related papers (2025-09-29T12:55:50Z)
H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning [25.65324419553667]
We introduce $textbfTriply-Hierarchical Diffusion Policy(textbfH$mathbf3$DP)$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation.<n> Extensive experiments demonstrate that H$3$DP yields a $mathbf+27.5%$ average relative improvement over baselines across $mathbf44$ simulation tasks and achieves superior performance in $mathbf4$ challenging bimanual real-world manipulation tasks.
arXiv Detail & Related papers (2025-05-12T17:59:43Z)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
The Optimization Landscape of SGD Across the Feature Learning Strength [102.1353410293931]
We study the effect of scaling $gamma$ across a variety of models and datasets in the online training setting.<n>We find that optimal online performance is often found at large $gamma$.<n>Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
arXiv Detail & Related papers (2024-10-06T22:30:14Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms.<n>We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency.<n>Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z)
How Graph Neural Networks Learn: Lessons from Training Dynamics [80.41778059014393]
We study the training dynamics in function space of graph neural networks (GNNs) We find that the gradient descent optimization of GNNs implicitly leverages the graph structure to update the learned function. This finding offers new interpretable insights into when and why the learned GNN functions generalize.
arXiv Detail & Related papers (2023-10-08T10:19:56Z)
On Single Index Models beyond Gaussian Data [45.875461749455994]
Sparse high-dimensional functions have arisen as a rich framework to study the behavior of gradient-descent methods. In this work, we explore extensions of this picture beyond the Gaussian setting where both stability or symmetry might be violated. Our main results establish that Gradient Descent can efficiently recover the unknown direction $theta*$ in the high-dimensional regime.
arXiv Detail & Related papers (2023-07-28T20:52:22Z)
Hierarchical Learning in Euclidean Neural Networks [0.0]
We study the role of higher order (non-scalar) features in Euclidean Neural Networks (texttte3nn) We find a natural hierarchy of features by $l$, reminiscent of a multipole expansion.
arXiv Detail & Related papers (2022-10-10T15:26:00Z)
Neural networks behave as hash encoders: An empirical study [79.38436088982283]
The input space of a neural network with ReLU-like activations is partitioned into multiple linear regions. We demonstrate that this partition exhibits the following encoding properties across a variety of deep learning models. Simple algorithms, such as $K$-Means, $K$-NN, and logistic regression, can achieve fairly good performance on both training and test data.
arXiv Detail & Related papers (2021-01-14T07:50:40Z)
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective. We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.