Understanding Grokking Through A Robustness Viewpoint
- URL: http://arxiv.org/abs/2311.06597v2
- Date: Fri, 2 Feb 2024 14:03:32 GMT
- Title: Understanding Grokking Through A Robustness Viewpoint
- Authors: Zhiquan Tan, Weiran Huang
- Abstract summary: We show that the popular $l$ norm (metric) of the neural network is actually a sufficient condition for grokking.
We propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
- Score: 3.23379981095083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, an interesting phenomenon called grokking has gained much
attention, where generalization occurs long after the models have initially
overfitted the training data. We try to understand this seemingly strange
phenomenon through the robustness of the neural network. From a robustness
perspective, we show that the popular $l_2$ weight norm (metric) of the neural
network is actually a sufficient condition for grokking. Based on the previous
observations, we propose perturbation-based methods to speed up the
generalization process. In addition, we examine the standard training process
on the modulo addition dataset and find that it hardly learns other basic group
operations before grokking, for example, the commutative law. Interestingly,
the speed-up of generalization when using our proposed method can be explained
by learning the commutative law, a necessary condition when the model groks on
the test dataset. We also empirically find that $l_2$ norm correlates with
grokking on the test data not in a timely way, we propose new metrics based on
robustness and information theory and find that our new metrics correlate well
with the grokking phenomenon and may be used to predict grokking.
Related papers
- Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label.
We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z) - Anomaly Detection by Context Contrasting [57.695202846009714]
Anomaly detection focuses on identifying samples that deviate from the norm.
Recent advances in self-supervised learning have shown great promise in this regard.
We propose Con$$, which learns through context augmentations.
arXiv Detail & Related papers (2024-05-29T07:59:06Z) - Progress Measures for Grokking on Real-world Tasks [0.0]
Grokking is a phenomenon where machine learning models generalize long after overfitting.
This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss.
arXiv Detail & Related papers (2024-05-21T13:06:41Z) - Grokking in Linear Estimators -- A Solvable Model that Groks without
Understanding [1.1510009152620668]
Grokking is where a model learns to generalize long after it has fit the training data.
We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
arXiv Detail & Related papers (2023-10-25T08:08:44Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Grokking phase transitions in learning local rules with gradient descent [0.0]
We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution.
We numerically analyse the connection between structure formation and grokking.
arXiv Detail & Related papers (2022-10-26T11:07:04Z) - Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - Information-Theoretic Generalization Bounds for Iterative
Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles.
Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data
to Learn Robust and Invariant Representations [76.85274970052762]
Regularizing distance between embeddings/representations of original samples and augmented counterparts is a popular technique for improving robustness of neural networks.
In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings.
We show that the generic approach we identified (squared $ell$ regularized augmentation) outperforms several recent methods, which are each specially designed for one task.
arXiv Detail & Related papers (2020-11-25T22:40:09Z) - Benign overfitting in ridge regression [0.0]
We provide non-asymptotic generalization bounds for overparametrized ridge regression.
We identify when small or negative regularization is sufficient for obtaining small generalization error.
arXiv Detail & Related papers (2020-09-29T20:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.