Related papers: Understanding Grokking Through A Robustness Viewpoint

Understanding Grokking Through A Robustness Viewpoint

URL: http://arxiv.org/abs/2311.06597v2
Date: Fri, 2 Feb 2024 14:03:32 GMT
Title: Understanding Grokking Through A Robustness Viewpoint
Authors: Zhiquan Tan, Weiran Huang
Abstract summary: We show that the popular $l$ norm (metric) of the neural network is actually a sufficient condition for grokking. We propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
Score: 3.23379981095083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, an interesting phenomenon called grokking has gained much attention, where generalization occurs long after the models have initially overfitted the training data. We try to understand this seemingly strange phenomenon through the robustness of the neural network. From a robustness perspective, we show that the popular $l_2$ weight norm (metric) of the neural network is actually a sufficient condition for grokking. Based on the previous observations, we propose perturbation-based methods to speed up the generalization process. In addition, we examine the standard training process on the modulo addition dataset and find that it hardly learns other basic group operations before grokking, for example, the commutative law. Interestingly, the speed-up of generalization when using our proposed method can be explained by learning the commutative law, a necessary condition when the model groks on the test dataset. We also empirically find that $l_2$ norm correlates with grokking on the test data not in a timely way, we propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.

Related papers

Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model [26.536857505794092]
''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. This paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay.
arXiv Detail & Related papers (2025-04-17T19:08:40Z)
Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting. In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z)
Grokking Explained: A Statistical Phenomenon [4.113597666007784]
Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This paper formalizes and investigates grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data.
arXiv Detail & Related papers (2025-02-03T19:28:11Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label. We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z)
Anomaly Detection by Context Contrasting [57.695202846009714]
Anomaly detection focuses on identifying samples that deviate from the norm. Recent advances in self-supervised learning have shown great promise in this regard. We propose Con$$, which learns through context augmentations.
arXiv Detail & Related papers (2024-05-29T07:59:06Z)
Progress Measures for Grokking on Real-world Tasks [0.0]
Grokking is a phenomenon where machine learning models generalize long after overfitting. This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss.
arXiv Detail & Related papers (2024-05-21T13:06:41Z)
Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding [1.1510009152620668]
Grokking is where a model learns to generalize long after it has fit the training data. We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
arXiv Detail & Related papers (2023-10-25T08:08:44Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Grokking phase transitions in learning local rules with gradient descent [0.0]
We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. We numerically analyse the connection between structure formation and grokking.
arXiv Detail & Related papers (2022-10-26T11:07:04Z)
Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set. For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations [76.85274970052762]
Regularizing distance between embeddings/representations of original samples and augmented counterparts is a popular technique for improving robustness of neural networks. In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings. We show that the generic approach we identified (squared $ell$ regularized augmentation) outperforms several recent methods, which are each specially designed for one task.
arXiv Detail & Related papers (2020-11-25T22:40:09Z)
Benign overfitting in ridge regression [0.0]
We provide non-asymptotic generalization bounds for overparametrized ridge regression. We identify when small or negative regularization is sufficient for obtaining small generalization error.
arXiv Detail & Related papers (2020-09-29T20:00:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.