Grokking phase transitions in learning local rules with gradient descent
- URL: http://arxiv.org/abs/2210.15435v1
- Date: Wed, 26 Oct 2022 11:07:04 GMT
- Title: Grokking phase transitions in learning local rules with gradient descent
- Authors: Bojan \v{Z}unkovi\v{c}, Enej Ilievski
- Abstract summary: We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution.
We numerically analyse the connection between structure formation and grokking.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We discuss two solvable grokking (generalisation beyond overfitting) models
in a rule learning scenario. We show that grokking is a phase transition and
find exact analytic expressions for the critical exponents, grokking
probability, and grokking time distribution. Further, we introduce a
tensor-network map that connects the proposed grokking setup with the standard
(perceptron) statistical learning theory and show that grokking is a
consequence of the locality of the teacher model. As an example, we analyse the
cellular automata learning task, numerically determine the critical exponent
and the grokking time distributions and compare them with the prediction of the
proposed grokking model. Finally, we numerically analyse the connection between
structure formation and grokking.
Related papers
- Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label.
We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z) - von Mises Quasi-Processes for Bayesian Circular Regression [57.88921637944379]
We explore a family of expressive and interpretable distributions over circle-valued random functions.
The resulting probability model has connections with continuous spin models in statistical physics.
For posterior inference, we introduce a new Stratonovich-like augmentation that lends itself to fast Markov Chain Monte Carlo sampling.
arXiv Detail & Related papers (2024-06-19T01:57:21Z) - Understanding Grokking Through A Robustness Viewpoint [3.23379981095083]
We show that the popular $l$ norm (metric) of the neural network is actually a sufficient condition for grokking.
We propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
arXiv Detail & Related papers (2023-11-11T15:45:44Z) - Grokking in Linear Estimators -- A Solvable Model that Groks without
Understanding [1.1510009152620668]
Grokking is where a model learns to generalize long after it has fit the training data.
We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
arXiv Detail & Related papers (2023-10-25T08:08:44Z) - Grokking as a First Order Phase Transition in Two Layer Networks [4.096453902709292]
A key property of deep neural networks (DNNs) is their ability to learn new features during training.
Grokking is also believed to be a beyond lazy-learning/Gaussian Process phenomenon involving feature learning.
We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition.
arXiv Detail & Related papers (2023-10-05T18:00:01Z) - Numerically assisted determination of local models in network scenarios [55.2480439325792]
We develop a numerical tool for finding explicit local models that reproduce a given statistical behaviour.
We provide conjectures for the critical visibilities of the Greenberger-Horne-Zeilinger (GHZ) and W distributions.
The developed codes and documentation are publicly available at281.com/mariofilho/localmodels.
arXiv Detail & Related papers (2023-03-17T13:24:04Z) - Bayesian Structure Learning with Generative Flow Networks [85.84396514570373]
In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) from data.
Recently, a class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling.
We show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs.
arXiv Detail & Related papers (2022-02-28T15:53:10Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - Why do classifier accuracies show linear trends under distribution
shift? [58.40438263312526]
accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution.
We assume the probability that two models agree in their predictions is higher than what we can infer from their accuracy levels alone.
We show that a linear trend must occur when evaluating models on two distributions unless the size of the distribution shift is large.
arXiv Detail & Related papers (2020-12-31T07:24:30Z) - Evaluation of Local Explanation Methods for Multivariate Time Series
Forecasting [0.21094707683348418]
Local interpretability is important in determining why a model makes particular predictions.
Despite the recent focus on AI interpretability, there has been a lack of research in local interpretability methods for time series forecasting.
arXiv Detail & Related papers (2020-09-18T21:15:28Z) - Block-Approximated Exponential Random Graphs [77.4792558024487]
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs.
We propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions.
Our methods are scalable to sparse graphs consisting of millions of nodes.
arXiv Detail & Related papers (2020-02-14T11:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.