Related papers: To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

URL: http://arxiv.org/abs/2310.13061v2
Date: Mon, 4 Mar 2024 21:59:58 GMT
Title: To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Authors: Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov
Abstract summary: We study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. We show that (i) it is possible for the network to memorize the corrupted labels emphand achieve $100%$ generalization at the same time. We also show that in the presence of regularization, the training dynamics involves two consecutive stages.
Score: 5.854190253899593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.

Related papers

Decoding Generalization from Memorization in Deep Neural Networks [0.0]
Deep Neural Networks that generalize well have been key to the dramatic success of Deep Learning in recent years. It has been known that deep networks possess the ability to memorize training data, as evidenced by perfect or high training accuracies on models trained with corrupted data that have class labels shuffled to varying degrees. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially improved generalization, even in the face of memorization.
arXiv Detail & Related papers (2025-01-24T18:01:27Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Grokking as the Transition from Lazy to Rich Training Dynamics [35.186196991224286]
grokking occurs when the train loss of a neural network decreases much earlier than its test loss. Key determinants of grokking are the rate of feature learning and the alignment of the initial features with the target function.
arXiv Detail & Related papers (2023-10-09T19:33:21Z)
Explaining grokking through circuit efficiency [4.686548060335767]
grokking is a network with perfect training accuracy but poor generalisation will transition to perfect generalisation. We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation. We demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.
arXiv Detail & Related papers (2023-09-05T17:00:24Z)
Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten. We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly. SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z)
The Curious Case of Benign Memorization [19.74244993871716]
We show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers.
arXiv Detail & Related papers (2022-10-25T13:41:31Z)
How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm. We prove the benefits of unlabeled data in both training convergence and generalization ability. Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z)
Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation [54.899751055620904]
We study the phenomenon of memorization with turn-over dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data) Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g., real data) and difficult examples (e.g., random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one.
arXiv Detail & Related papers (2021-12-16T11:34:23Z)
What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training. Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z)
Exploring Memorization in Adversarial Training [58.38336773082818]
We investigate the memorization effect in adversarial training (AT) for promoting a deeper understanding of capacity, convergence, generalization, and especially robust overfitting. We propose a new mitigation algorithm motivated by detailed memorization analyses.
arXiv Detail & Related papers (2021-06-03T05:39:57Z)
Improving Generalization by Controlling Label-Noise Information in Neural Network Weights [33.85101318266319]
In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decay or data augmentation sometimes help, but do not prevent this behavior. We show that for any training algorithm, low values of this term correspond to reduction in memorization of label-noise and better bounds.
arXiv Detail & Related papers (2020-02-19T00:08:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.