A Tale of Two Circuits: Grokking as Competition of Sparse and Dense
Subnetworks
- URL: http://arxiv.org/abs/2303.11873v1
- Date: Tue, 21 Mar 2023 14:17:29 GMT
- Title: A Tale of Two Circuits: Grokking as Competition of Sparse and Dense
Subnetworks
- Authors: William Merrill, Nikolaos Tsilivis, Aman Shukla
- Abstract summary: We study the internal structure of networks undergoing grokking on the sparse parity task.
We find that the grokking phase transition corresponds to the emergence of a sparse subnetwork that dominates model predictions.
- Score: 1.5297569497776375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grokking is a phenomenon where a model trained on an algorithmic task first
overfits but, then, after a large amount of additional training, undergoes a
phase transition to generalize perfectly. We empirically study the internal
structure of networks undergoing grokking on the sparse parity task, and find
that the grokking phase transition corresponds to the emergence of a sparse
subnetwork that dominates model predictions. On an optimization level, we find
that this subnetwork arises when a small subset of neurons undergoes rapid norm
growth, whereas the other neurons in the network decay slowly in norm. Thus, we
suggest that the grokking phase transition can be understood to emerge from
competition of two largely distinct subnetworks: a dense one that dominates
before the transition and generalizes poorly, and a sparse one that dominates
afterwards.
Related papers
- Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization [41.20978920228298]
We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize.
We also show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors.
Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.
arXiv Detail & Related papers (2024-06-12T21:33:22Z) - Deep Grokking: Would Deep Neural Networks Generalize Better? [51.24007462968805]
Grokking refers to a sharp rise of the network's generalization accuracy on the test set.
We find that deep neural networks can be more susceptible to grokking than its shallower counterparts.
We also observe an intriguing multi-stage generalization phenomenon when increase the depth of the model.
arXiv Detail & Related papers (2024-05-29T19:05:11Z) - On the training and generalization of deep operator networks [11.159056906971983]
We present a novel training method for deep operator networks (DeepONets)
DeepONets are constructed by two sub-networks.
We establish the width error estimate in terms of input data.
arXiv Detail & Related papers (2023-09-02T21:10:45Z) - Convergence Guarantees of Overparametrized Wide Deep Inverse Prior [1.5362025549031046]
Inverse Priors is an unsupervised approach to transform a random input into an object whose image under the forward model matches the observation.
We provide overparametrization bounds under which such network trained via continuous-time gradient descent will converge exponentially fast with high probability.
This work is thus a first step towards a theoretical understanding of overparametrized DIP networks, and more broadly it participates to the theoretical understanding of neural networks in inverse problem settings.
arXiv Detail & Related papers (2023-03-20T16:49:40Z) - Exact Phase Transitions in Deep Learning [5.33024001730262]
We prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer.
The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.
arXiv Detail & Related papers (2022-05-25T06:00:34Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - The Sample Complexity of One-Hidden-Layer Neural Networks [57.6421258363243]
We study a class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm.
We prove that controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees.
We analyze two important settings where a mere spectral norm control turns out to be sufficient.
arXiv Detail & Related papers (2022-02-13T07:12:02Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring.
Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains.
The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.