Is Grokking a Computational Glass Relaxation?
- URL: http://arxiv.org/abs/2505.11411v1
- Date: Fri, 16 May 2025 16:20:02 GMT
- Title: Is Grokking a Computational Glass Relaxation?
- Authors: Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang,
- Abstract summary: We study the phenomenon of grokking, where neural networks abruptly generalize long after the training performance reaches a near-perfect level.<n>We find NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition.<n>Inspired by grokking's far-from-equilibrium nature, we develop a toy WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions.
- Score: 7.193952396909214
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
Related papers
- NeuralGrok: Accelerate Grokking by Neural Gradient Transformation [54.65707216563953]
We propose NeuralGrok, a gradient-based approach that learns an optimal gradient transformation to accelerate generalization of transformers in arithmetic tasks.<n>Our experiments demonstrate that NeuralGrok significantly accelerates generalization, particularly in challenging arithmetic tasks.<n>We also show that NeuralGrok promotes a more stable training paradigm, constantly reducing the model's complexity.
arXiv Detail & Related papers (2025-04-24T04:41:35Z) - Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z) - The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity [0.7499722271664144]
We study how non-linear activation functions contribute to shaping neural networks' implicit bias.<n>We show that local dynamical attractors facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero.
arXiv Detail & Related papers (2025-03-13T17:36:46Z) - Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Towards Demystifying the Generalization Behaviors When Neural Collapse
Emerges [132.62934175555145]
Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT)
We propose a theoretical explanation for why continuing training can still lead to accuracy improvement on test set, even after the train accuracy has reached 100%.
We refer to this newly discovered property as "non-conservative generalization"
arXiv Detail & Related papers (2023-10-12T14:29:02Z) - Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks [28.87871359825978]
We reveal a strong implicit bias of gradient of descent (SGD) that drives overly expressive networks to much simplerworks.
We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank)works and commonly appear in modern architectures.
We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses vanishing simpleworks with either redundant neurons.
arXiv Detail & Related papers (2023-06-07T08:44:51Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - On generalization bounds for deep networks based on loss surface
implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z) - Partial local entropy and anisotropy in deep weight spaces [0.0]
We refine a recently-proposed class of local entropic loss functions by restricting the smoothening regularization to only a subset of weights.
The new loss functions are referred to as partial local entropies. They can adapt to the weight-space anisotropy, thus outperforming their isotropic counterparts.
arXiv Detail & Related papers (2020-07-17T16:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.