Related papers: The Complexity Dynamics of Grokking

The Complexity Dynamics of Grokking

URL: http://arxiv.org/abs/2412.09810v1
Date: Fri, 13 Dec 2024 02:57:59 GMT
Title: The Complexity Dynamics of Grokking
Authors: Branton DeMoss, Silvia Sapora, Jakob Foerster, Nick Hawes, Ingmar Posner,
Abstract summary: We introduce a new measure of intrinsic complexity for neural networks based on the theory of Kolmogorov complexity.<n> Tracking this metric throughout network training, we find a consistent pattern in training dynamics, consisting of a rise and fall in complexity.<n>Based on insights from rate--distortion theory and the minimum description length principle, we lay out a principled approach to lossy compression of neural networks.
Score: 21.075837465689887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transition from memorizing to generalizing solutions long after over-fitting the training data. To this end we introduce a new measure of intrinsic complexity for neural networks based on the theory of Kolmogorov complexity. Tracking this metric throughout network training, we find a consistent pattern in training dynamics, consisting of a rise and fall in complexity. We demonstrate that this corresponds to memorization followed by generalization. Based on insights from rate--distortion theory and the minimum description length principle, we lay out a principled approach to lossy compression of neural networks, and connect our complexity measure to explicit generalization bounds. Based on a careful analysis of information capacity in neural networks, we propose a new regularization method which encourages networks towards low-rank representations by penalizing their spectral entropy, and find that our regularizer outperforms baselines in total compression of the dataset.

Related papers

Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Correlation between entropy and generalizability in a neural network [9.223853439465582]
We use Wang-Landau Mote Carlo algorithm to calculate the entropy at a given test accuracy. Our results show that entropical forces help generalizability.
arXiv Detail & Related papers (2022-07-05T12:28:13Z)
How You Start Matters for Generalization [26.74340246715699]
We show that the generalization of neural networks is heavily tied to their initializes. We make a case against the controversial flat-minima conjecture.
arXiv Detail & Related papers (2022-06-17T05:30:56Z)
Learning Theory Can (Sometimes) Explain Generalisation in Graph Neural Networks [13.518582483147325]
We provide a rigorous analysis of the performance of neural networks in the context of transductive inference. We show that transductive Rademacher complexity can explain the generalisation properties of graph convolutional networks for block models.
arXiv Detail & Related papers (2021-12-07T20:06:23Z)
A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation. Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z)
Formalizing Generalization and Robustness of Neural Networks to Weight Perturbations [58.731070632586594]
We provide the first formal analysis for feed-forward neural networks with non-negative monotone activation functions against weight perturbations. We also design a new theory-driven loss function for training generalizable and robust neural networks against weight perturbations.
arXiv Detail & Related papers (2021-03-03T06:17:03Z)
Compressive Sensing and Neural Networks from a Statistical Learning Perspective [4.561032960211816]
We present a generalization error analysis for a class of neural networks suitable for sparse reconstruction from few linear measurements. Under realistic conditions, the generalization error scales only logarithmically in the number of layers, and at most linear in number of measurements.
arXiv Detail & Related papers (2020-10-29T15:05:43Z)
Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error. Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z)
Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective. We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.