Related papers: Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

URL: http://arxiv.org/abs/2504.13292v1
Date: Thu, 17 Apr 2025 19:08:40 GMT
Title: Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model
Authors: Zhiwei Xu, Zhiyu Ni, Yixin Wang, Wei Hu,
Abstract summary: ''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training.<n>This paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks.<n>We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay.
Score: 26.536857505794092
License: http://creativecommons.org/licenses/by/4.0/
Abstract: ''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.

Related papers

Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models [51.03144354630136]
Generalization in natural data domains is progressively achieved during training before the onset of memorization.<n>Generalization vs. memorization is then best understood as a competition between time scales.<n>We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules.
arXiv Detail & Related papers (2025-05-22T17:40:08Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
Understanding Grokking Through A Robustness Viewpoint [3.23379981095083]
We show that the popular $l$ norm (metric) of the neural network is actually a sufficient condition for grokking. We propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
arXiv Detail & Related papers (2023-11-11T15:45:44Z)
Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding [1.1510009152620668]
Grokking is where a model learns to generalize long after it has fit the training data. We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
arXiv Detail & Related papers (2023-10-25T08:08:44Z)
Grokking as the Transition from Lazy to Rich Training Dynamics [35.186196991224286]
grokking occurs when the train loss of a neural network decreases much earlier than its test loss. Key determinants of grokking are the rate of feature learning and the alignment of the initial features with the target function.
arXiv Detail & Related papers (2023-10-09T19:33:21Z)
Explaining grokking through circuit efficiency [4.686548060335767]
grokking is a network with perfect training accuracy but poor generalisation will transition to perfect generalisation. We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation. We demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.
arXiv Detail & Related papers (2023-09-05T17:00:24Z)
Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency. We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training. We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Theoretical Characterization of How Neural Network Pruning Affects its Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero. More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.