Explaining grokking through circuit efficiency
- URL: http://arxiv.org/abs/2309.02390v1
- Date: Tue, 5 Sep 2023 17:00:24 GMT
- Title: Explaining grokking through circuit efficiency
- Authors: Vikrant Varma, Rohin Shah, Zachary Kenton, J\'anos Kram\'ar, Ramana
Kumar
- Abstract summary: grokking is a network with perfect training accuracy but poor generalisation will transition to perfect generalisation.
We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation.
We demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.
- Score: 4.686548060335767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most surprising puzzles in neural network generalisation is
grokking: a network with perfect training accuracy but poor generalisation
will, upon further training, transition to perfect generalisation. We propose
that grokking occurs when the task admits a generalising solution and a
memorising solution, where the generalising solution is slower to learn but
more efficient, producing larger logits with the same parameter norm. We
hypothesise that memorising circuits become more inefficient with larger
training datasets while generalising circuits do not, suggesting there is a
critical dataset size at which memorisation and generalisation are equally
efficient. We make and confirm four novel predictions about grokking, providing
significant evidence in favour of our explanation. Most strikingly, we
demonstrate two novel and surprising behaviours: ungrokking, in which a network
regresses from perfect to low test accuracy, and semi-grokking, in which a
network shows delayed generalisation to partial rather than perfect test
accuracy.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization? [27.020990219204343]
We aim to analyze the mechanism of grokking from the lottery ticket hypothesis.
We refer to theseworks as ''Grokking tickets''
We show that the lottery tickets drastically accelerate grokking compared to the dense networks.
arXiv Detail & Related papers (2023-10-30T11:58:44Z) - To grok or not to grok: Disentangling generalization and memorization on
corrupted algorithmic datasets [5.854190253899593]
We study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones.
We show that (i) it is possible for the network to memorize the corrupted labels emphand achieve $100%$ generalization at the same time.
We also show that in the presence of regularization, the training dynamics involves two consecutive stages.
arXiv Detail & Related papers (2023-10-19T18:01:10Z) - The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness
in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks.
In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Towards Better Out-of-Distribution Generalization of Neural Algorithmic
Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks.
The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z) - Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective.
We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like.
Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z) - Grokking: Generalization Beyond Overfitting on Small Algorithmic
Datasets [4.278591555984394]
We study generalization of neural networks on small algorithmically generated datasets.
We show that neural networks learn through a process of "grokking" a pattern in the data.
We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning.
arXiv Detail & Related papers (2022-01-06T18:43:37Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - A Partial Regularization Method for Network Compression [0.0]
We propose an approach of partial regularization rather than the original form of penalizing all parameters, which is said to be full regularization, to conduct model compression at a higher speed.
Experimental results show that as we expected, the computational complexity is reduced by observing less running time in almost all situations.
Surprisingly, it helps to improve some important metrics such as regression fitting results and classification accuracy in both training and test phases on multiple datasets.
arXiv Detail & Related papers (2020-09-03T00:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.