Related papers: Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

URL: http://arxiv.org/abs/2310.16441v1
Date: Wed, 25 Oct 2023 08:08:44 GMT
Title: Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding
Authors: Noam Levi and Alon Beck and Yohai Bar-Sinai
Abstract summary: Grokking is where a model learns to generalize long after it has fit the training data. We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
Score: 1.1510009152620668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grokking is the intriguing phenomenon where a model learns to generalize long after it has fit the training data. We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. In this setting, the full training dynamics is derived in terms of the training and generalization data covariance matrix. We present exact predictions on how the grokking time depends on input and output dimensionality, train sample size, regularization, and network initialization. We demonstrate that the sharp increase in generalization accuracy may not imply a transition from "memorization" to "understanding", but can simply be an artifact of the accuracy measure. We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.

Related papers

Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model [26.536857505794092]
''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. This paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay.
arXiv Detail & Related papers (2025-04-17T19:08:40Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label. We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z)
Bayes' Power for Explaining In-Context Learning Generalizations [46.17844703369127]
In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior. We show how models become robust in-context learners by effectively composing knowledge from their training data.
arXiv Detail & Related papers (2024-10-02T14:01:34Z)
Understanding Grokking Through A Robustness Viewpoint [3.23379981095083]
We show that the popular $l$ norm (metric) of the neural network is actually a sufficient condition for grokking. We propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
arXiv Detail & Related papers (2023-11-11T15:45:44Z)
Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data [42.870635753205185]
Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. We show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a "grokking" phenomenon.
arXiv Detail & Related papers (2023-10-04T02:50:34Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Grokking phase transitions in learning local rules with gradient descent [0.0]
We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. We numerically analyse the connection between structure formation and grokking.
arXiv Detail & Related papers (2022-10-26T11:07:04Z)
Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque. Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z)
Learning Invariances in Neural Networks [51.20867785006147]
We show how to parameterize a distribution over augmentations and optimize the training loss simultaneously with respect to the network parameters and augmentation parameters. We can recover the correct set and extent of invariances on image classification, regression, segmentation, and molecular property prediction from a large space of augmentations.
arXiv Detail & Related papers (2020-10-22T17:18:48Z)
Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data [48.4779912667317]
Self-training algorithms have been very successful for learning with unlabeled data using neural networks. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning.
arXiv Detail & Related papers (2020-10-07T19:43:55Z)
Semi-Supervised Learning with Normalizing Flows [54.376602201489995]
FlowGMM is an end-to-end approach to generative semi supervised learning with normalizing flows. We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data.
arXiv Detail & Related papers (2019-12-30T17:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.