Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
- URL: http://arxiv.org/abs/2509.20829v1
- Date: Thu, 25 Sep 2025 07:17:41 GMT
- Title: Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
- Authors: Keitaro Sakamoto, Issei Sato,
- Abstract summary: We present a unified explanation of grokking and the information bottleneck principle through the lens of neural collapse.<n>We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck.<n>By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena.
- Score: 33.22494588674352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.
Related papers
- Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent [5.5165579223151795]
In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch gradient, then derive differential equations with different time scales.<n>The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs.<n>Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning.
arXiv Detail & Related papers (2026-02-07T05:50:44Z) - Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv Detail & Related papers (2025-07-02T20:03:34Z) - The emergence of sparse attention: impact of data distribution and benefits of repetition [14.652502263025882]
We study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers.<n>By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics sparse attention emergence.<n>Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.
arXiv Detail & Related papers (2025-05-23T13:14:02Z) - Do Mice Grok? Glimpses of Hidden Progress During Overtraining in Sensory Cortex [32.79706360108185]
We find evidence for such learning in mice following continued training on a task, long after behavior saturates at near-ceiling performance ("overtraining")<n>We demonstrate that class representations in cortex continue to separate during overtraining, so that examples that were incorrectly classified at the beginning of overtraining can abruptly be correctly classified later on, despite no changes in behavior during that time.<n>We conclude by showing how this model of late-time feature learning implies an explanation for the empirical puzzle overtraining reversal in animal learning.
arXiv Detail & Related papers (2024-11-05T22:42:49Z) - Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.