A Tale of Two Circuits: Grokking as Competition of Sparse and Dense
Subnetworks
- URL: http://arxiv.org/abs/2303.11873v1
- Date: Tue, 21 Mar 2023 14:17:29 GMT
- Title: A Tale of Two Circuits: Grokking as Competition of Sparse and Dense
Subnetworks
- Authors: William Merrill, Nikolaos Tsilivis, Aman Shukla
- Abstract summary: We study the internal structure of networks undergoing grokking on the sparse parity task.
We find that the grokking phase transition corresponds to the emergence of a sparse subnetwork that dominates model predictions.
- Score: 1.5297569497776375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grokking is a phenomenon where a model trained on an algorithmic task first
overfits but, then, after a large amount of additional training, undergoes a
phase transition to generalize perfectly. We empirically study the internal
structure of networks undergoing grokking on the sparse parity task, and find
that the grokking phase transition corresponds to the emergence of a sparse
subnetwork that dominates model predictions. On an optimization level, we find
that this subnetwork arises when a small subset of neurons undergoes rapid norm
growth, whereas the other neurons in the network decay slowly in norm. Thus, we
suggest that the grokking phase transition can be understood to emerge from
competition of two largely distinct subnetworks: a dense one that dominates
before the transition and generalizes poorly, and a sparse one that dominates
afterwards.
Related papers
- Generalization emerges from local optimization in a self-organized learning network [0.0]
We design and analyze a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function.
Our network stores new knowledge in the nodes accurately and instantaneously, in the form of a lookup table.
We show on numerous examples of classification tasks that the networks generated by our algorithm systematically reach such a state of perfect generalization when the number of learned examples becomes sufficiently large.
We report on the dynamics of the change of state and show that it is abrupt and has the distinctive characteristics of a first order phase transition, a phenomenon already observed for traditional learning networks and known as grokking.
arXiv Detail & Related papers (2024-10-03T15:32:08Z) - Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization [41.20978920228298]
We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize.
We also show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors.
Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.
arXiv Detail & Related papers (2024-06-12T21:33:22Z) - Deep Grokking: Would Deep Neural Networks Generalize Better? [51.24007462968805]
Grokking refers to a sharp rise of the network's generalization accuracy on the test set.
We find that deep neural networks can be more susceptible to grokking than its shallower counterparts.
We also observe an intriguing multi-stage generalization phenomenon when increase the depth of the model.
arXiv Detail & Related papers (2024-05-29T19:05:11Z) - On the training and generalization of deep operator networks [11.159056906971983]
We present a novel training method for deep operator networks (DeepONets)
DeepONets are constructed by two sub-networks.
We establish the width error estimate in terms of input data.
arXiv Detail & Related papers (2023-09-02T21:10:45Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Exact Phase Transitions in Deep Learning [5.33024001730262]
We prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer.
The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.
arXiv Detail & Related papers (2022-05-25T06:00:34Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - The Sample Complexity of One-Hidden-Layer Neural Networks [57.6421258363243]
We study a class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm.
We prove that controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees.
We analyze two important settings where a mere spectral norm control turns out to be sufficient.
arXiv Detail & Related papers (2022-02-13T07:12:02Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.