Grokking: Generalization Beyond Overfitting on Small Algorithmic
Datasets
- URL: http://arxiv.org/abs/2201.02177v1
- Date: Thu, 6 Jan 2022 18:43:37 GMT
- Title: Grokking: Generalization Beyond Overfitting on Small Algorithmic
Datasets
- Authors: Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant
Misra
- Abstract summary: We study generalization of neural networks on small algorithmically generated datasets.
We show that neural networks learn through a process of "grokking" a pattern in the data.
We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning.
- Score: 4.278591555984394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we propose to study generalization of neural networks on small
algorithmically generated datasets. In this setting, questions about data
efficiency, memorization, generalization, and speed of learning can be studied
in great detail. In some situations we show that neural networks learn through
a process of "grokking" a pattern in the data, improving generalization
performance from random chance level to perfect generalization, and that this
improvement in generalization can happen well past the point of overfitting. We
also study generalization as a function of dataset size and find that smaller
datasets require increasing amounts of optimization for generalization. We
argue that these datasets provide a fertile ground for studying a poorly
understood aspect of deep learning: generalization of overparametrized neural
networks beyond memorization of the finite training dataset.
Related papers
- Learning from Limited and Imperfect Data [6.30667368422346]
We develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world.
These works are divided into four segments, each covering a scenario of learning from limited or imperfect data.
arXiv Detail & Related papers (2024-11-11T18:48:31Z) - Generalizability of Memorization Neural Networks [13.144557876007358]
memorization is widely believed to have a close relationship with the strong generalizability of deep learning.
We show that, in order for the memorization networks to be generalizable, the width of the network must be at least equal to the dimension of the data.
It is also shown that there exist data distributions such that, to be generalizable for them, the memorization network must have an exponential number of parameters in the data dimension.
arXiv Detail & Related papers (2024-11-01T05:18:46Z) - Explaining grokking through circuit efficiency [4.686548060335767]
grokking is a network with perfect training accuracy but poor generalisation will transition to perfect generalisation.
We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation.
We demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.
arXiv Detail & Related papers (2023-09-05T17:00:24Z) - Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - Inducing Gaussian Process Networks [80.40892394020797]
We propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points.
The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains.
We report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods.
arXiv Detail & Related papers (2022-04-21T05:27:09Z) - Exploiting Explainable Metrics for Augmented SGD [43.00691899858408]
There are several unanswered questions about how learning under optimization really works and why certain strategies are better than others.
We propose new explainability metrics that measure the redundant information in a network's layers.
We then exploit these metrics to augment the Gradient Descent (SGD) by adaptively adjusting the learning rate in each layer to improve generalization performance.
arXiv Detail & Related papers (2022-03-31T00:16:44Z) - Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque.
Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z) - A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation.
Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z) - Feature space approximation for kernel-based supervised learning [2.653409741248232]
The goal is to reduce the size of the training data, resulting in lower storage consumption and computational complexity.
We demonstrate significant improvements in comparison to the computation of data-driven predictions involving the full training data set.
The method is applied to classification and regression problems from different application areas such as image recognition, system identification, and oceanographic time series analysis.
arXiv Detail & Related papers (2020-11-25T11:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.