Progress Measures for Grokking on Real-world Tasks
- URL: http://arxiv.org/abs/2405.12755v2
- Date: Thu, 20 Jun 2024 07:39:05 GMT
- Title: Progress Measures for Grokking on Real-world Tasks
- Authors: Satvik Golechha,
- Abstract summary: Grokking is a phenomenon where machine learning models generalize long after overfitting.
This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Grokking, a phenomenon where machine learning models generalize long after overfitting, has been primarily observed and studied in algorithmic tasks. This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss. We challenge the prevalent hypothesis that the $L_2$ norm of weights is the primary cause of grokking by demonstrating that grokking can occur outside the expected range of weight norms. To better understand grokking, we introduce three new progress measures: activation sparsity, absolute weight entropy, and approximate local circuit complexity. These measures are conceptually related to generalization and demonstrate a stronger correlation with grokking in real-world datasets compared to weight norms. Our findings suggest that while weight norms might usually correlate with grokking and our progress measures, they are not causative, and our proposed measures provide a better understanding of the dynamics of grokking.
Related papers
- Deep Grokking: Would Deep Neural Networks Generalize Better? [51.24007462968805]
Grokking refers to a sharp rise of the network's generalization accuracy on the test set.
We find that deep neural networks can be more susceptible to grokking than its shallower counterparts.
We also observe an intriguing multi-stage generalization phenomenon when increase the depth of the model.
arXiv Detail & Related papers (2024-05-29T19:05:11Z) - Understanding Grokking Through A Robustness Viewpoint [3.23379981095083]
We show that the popular $l$ norm (metric) of the neural network is actually a sufficient condition for grokking.
We propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
arXiv Detail & Related papers (2023-11-11T15:45:44Z) - Grokking in Linear Estimators -- A Solvable Model that Groks without
Understanding [1.1510009152620668]
Grokking is where a model learns to generalize long after it has fit the training data.
We show analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks.
arXiv Detail & Related papers (2023-10-25T08:08:44Z) - Interpolation can hurt robust generalization even when there is no noise [76.3492338989419]
We show that avoiding generalization through ridge regularization can significantly improve generalization even in the absence of noise.
We prove this phenomenon for the robust risk of both linear regression and classification and hence provide the first theoretical result on robust overfitting.
arXiv Detail & Related papers (2021-08-05T23:04:15Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring.
Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains.
The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z) - Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective.
We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.