Related papers: Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

URL: http://arxiv.org/abs/2506.21551v2
Date: Thu, 03 Jul 2025 01:00:21 GMT
Title: Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
Authors: Ziyue Li, Chenrui Fan, Tianyi Zhou,
Abstract summary: We conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE.<n>Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models.<n>We develop two novel metrics to quantify pathway distance and the complexity of a single pathway.
Score: 19.213961869113188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking's "emergence of generalization" by investigating LLM internal dynamics. Specifically, we find that training samples' pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples. Also, the complexity of a sample's pathway reduces despite the converged loss. These indicate a memorization-to-generalization "knowledge digestion", providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound.

Related papers

Grokking Explained: A Statistical Phenomenon [4.113597666007784]
Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged.<n>This paper formalizes and investigates grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data.
arXiv Detail & Related papers (2025-02-03T19:28:11Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning [86.19804569376333]
We show that zero-shot generalization happens very early during instruction tuning.<n>We propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement.
arXiv Detail & Related papers (2024-06-17T16:40:21Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Learning Curves for Sequential Training of Neural Networks: Self-Knowledge Transfer and Forgetting [9.734033555407406]
We consider neural networks in the neural tangent kernel regime that continually learn target functions from task to task. We investigate a variant of continual learning where the model learns the same target function in multiple tasks. Even for the same target, the trained model shows some transfer and forgetting depending on the sample size of each task.
arXiv Detail & Related papers (2021-12-03T00:25:01Z)
What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training. Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z)
Double Descent and Other Interpolation Phenomena in GANs [2.7007335372861974]
We study the generalization error as a function of latent space dimension in generative adversarial networks (GANs) We develop a novel pseudo-supervised learning approach for GANs where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. While our analysis focuses mostly on linear models, we also apply important insights for improving generalization of nonlinear, multilayer GANs.
arXiv Detail & Related papers (2021-06-07T23:07:57Z)
Exploring Memorization in Adversarial Training [58.38336773082818]
We investigate the memorization effect in adversarial training (AT) for promoting a deeper understanding of capacity, convergence, generalization, and especially robust overfitting. We propose a new mitigation algorithm motivated by detailed memorization analyses.
arXiv Detail & Related papers (2021-06-03T05:39:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.