Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
- URL: http://arxiv.org/abs/2311.18817v2
- Date: Tue, 2 Apr 2024 05:43:18 GMT
- Title: Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
- Authors: Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu,
- Abstract summary: Recent work by Power et al. highlighted a surprising "grokking" phenomenon in learning arithmetic tasks.
A neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy.
This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases.
- Score: 81.57031092474625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy.
Related papers
- Tune without Validation: Searching for Learning Rate and Weight Decay on
Training Sets [0.0]
Tune without validation (Twin) is a pipeline for tuning learning rate and weight decay.
We run extensive experiments on 20 image classification datasets and train several families of deep networks.
We demonstrate proper HP selection when training from scratch and fine-tuning, emphasizing small-sample scenarios.
arXiv Detail & Related papers (2024-03-08T18:57:00Z) - Grokking in Linear Estimators -- A Solvable Model that Groks without
Understanding [1.1510009152620668]
Grokking is where a model learns to generalize long after it has fit the training data.
We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
arXiv Detail & Related papers (2023-10-25T08:08:44Z) - Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data [42.870635753205185]
Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors.
We show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data.
At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a "grokking" phenomenon.
arXiv Detail & Related papers (2023-10-04T02:50:34Z) - Small-scale proxies for large-scale Transformer training instabilities [69.36381318171338]
We seek ways to reproduce and study training stability and instability at smaller scales.
By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates.
We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
arXiv Detail & Related papers (2023-09-25T17:48:51Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Benign Overfitting in Two-layer Convolutional Neural Networks [90.75603889605043]
We study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN)
We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss.
On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss.
arXiv Detail & Related papers (2022-02-14T07:45:51Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z) - Regularizing Class-wise Predictions via Self-knowledge Distillation [80.76254453115766]
We propose a new regularization method that penalizes the predictive distribution between similar samples.
This results in regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a single network.
Our experimental results on various image classification tasks demonstrate that the simple yet powerful method can significantly improve the generalization ability.
arXiv Detail & Related papers (2020-03-31T06:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.