When Data Falls Short: Grokking Below the Critical Threshold
- URL: http://arxiv.org/abs/2511.04760v1
- Date: Thu, 06 Nov 2025 19:18:51 GMT
- Title: When Data Falls Short: Grokking Below the Critical Threshold
- Authors: Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi,
- Abstract summary: We investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data.<n>We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2)<n>We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization.
- Score: 21.974289057222133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.
Related papers
- Distributional Training Data Attribution: What do Influence Functions Sample? [25.257922996567178]
We introduce distributional training data attribution (d-TDA)<n>The goal of d-TDA is to predict how the distribution of model outputs depends upon the dataset.<n>We find that influence functions (IFs) are'secretly distributional'
arXiv Detail & Related papers (2025-06-15T21:02:36Z) - Grokking Explained: A Statistical Phenomenon [4.113597666007784]
Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged.<n>This paper formalizes and investigates grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data.
arXiv Detail & Related papers (2025-02-03T19:28:11Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.<n>We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.<n>Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be
Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time.
We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z) - Proposal Distribution Calibration for Few-Shot Object Detection [65.19808035019031]
In few-shot object detection (FSOD), the two-step training paradigm is widely adopted to mitigate the severe sample imbalance.
Unfortunately, the extreme data scarcity aggravates the proposal distribution bias, hindering the RoI head from evolving toward novel classes.
We introduce a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head.
arXiv Detail & Related papers (2022-12-15T05:09:11Z) - Learnable Distribution Calibration for Few-Shot Class-Incremental
Learning [122.2241120474278]
Few-shot class-incremental learning (FSCIL) faces challenges of memorizing old class distributions and estimating new class distributions given few training samples.
We propose a learnable distribution calibration (LDC) approach, with the aim to systematically solve these two challenges using a unified framework.
arXiv Detail & Related papers (2022-10-01T09:40:26Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - How to Learn when Data Gradually Reacts to Your Model [10.074466859579571]
We propose a new algorithm, Stateful Performative Gradient Descent (Stateful PerfGD), for minimizing the performative loss even in the presence of these effects.
Our experiments confirm that Stateful PerfGD substantially outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-12-13T22:05:26Z) - Robust Generalization despite Distribution Shift via Minimum
Discriminating Information [46.164498176119665]
We introduce a modeling framework where, in addition to training data, we have partial structural knowledge of the shifted test distribution.
We employ the principle of minimum discriminating information to embed the available prior knowledge.
We obtain explicit generalization bounds with respect to the unknown shifted distribution.
arXiv Detail & Related papers (2021-06-08T15:25:35Z) - Mind the Trade-off: Debiasing NLU Models without Degrading the
In-distribution Performance [70.31427277842239]
We introduce a novel debiasing method called confidence regularization.
It discourages models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples.
We evaluate our method on three NLU tasks and show that, in contrast to its predecessors, it improves the performance on out-of-distribution datasets.
arXiv Detail & Related papers (2020-05-01T11:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.