GrokAlign: Geometric Characterisation and Acceleration of Grokking
- URL: http://arxiv.org/abs/2506.12284v2
- Date: Thu, 31 Jul 2025 05:15:21 GMT
- Title: GrokAlign: Geometric Characterisation and Acceleration of Grokking
- Authors: Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk,
- Abstract summary: We show that aligning a network's Jacobians with the training data ensures grokking under a low-rank Jacobian assumption.<n>Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks.<n>We introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment.
- Score: 14.286905882211896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network's functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network's Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks -- a method we introduce as GrokAlign -- which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying webpage (https://thomaswalker1.github.io/blog/grokalign.html) and code (https://github.com/ThomasWalker1/grokalign).
Related papers
- Neural Networks with Orthogonal Jacobian [10.111901389604423]
Very deep neural networks achieve state-of-the-art performance by extracting rich, hierarchical features.<n>Training them via backpropagation is often hindered by vanishing or exploding gradients.<n>We introduce a unified mathematical framework that describes a broad class of nonlinear feedforward and residual networks.
arXiv Detail & Related papers (2025-08-04T20:24:49Z) - Generalization emerges from local optimization in a self-organized learning network [0.0]
We design and analyze a new paradigm for building supervised learning networks, driven only by local optimization rules without relying on a global error function.
Our network stores new knowledge in the nodes accurately and instantaneously, in the form of a lookup table.
We show on numerous examples of classification tasks that the networks generated by our algorithm systematically reach such a state of perfect generalization when the number of learned examples becomes sufficiently large.
We report on the dynamics of the change of state and show that it is abrupt and has the distinctive characteristics of a first order phase transition, a phenomenon already observed for traditional learning networks and known as grokking.
arXiv Detail & Related papers (2024-10-03T15:32:08Z) - Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality.
These networks still exhibit fluctuations that grow linearly with the depth of the network.
We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z) - Convergence Analysis and Implicit Regularization of Feedback Alignment
for Deep Linear Networks [27.614609336582568]
We theoretically analyze the Feedback Alignment (FA) algorithm, an efficient alternative to backpropagation for training neural networks.
We provide convergence guarantees with rates for deep linear networks for both continuous and discrete dynamics.
arXiv Detail & Related papers (2021-10-20T22:57:03Z) - Path Regularization: A Convexity and Sparsity Inducing Regularization
for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape.
We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z) - The Principles of Deep Learning Theory [19.33681537640272]
This book develops an effective theory approach to understanding deep neural networks of practical relevance.
We explain how these effectively-deep networks learn nontrivial representations from training.
We show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks.
arXiv Detail & Related papers (2021-06-18T15:00:00Z) - Learning Structures for Deep Neural Networks [99.8331363309895]
We propose to adopt the efficient coding principle, rooted in information theory and developed in computational neuroscience.
We show that sparse coding can effectively maximize the entropy of the output signals.
Our experiments on a public image classification dataset demonstrate that using the structure learned from scratch by our proposed algorithm, one can achieve a classification accuracy comparable to the best expert-designed structure.
arXiv Detail & Related papers (2021-05-27T12:27:24Z) - Rethinking Skip Connection with Layer Normalization in Transformers and
ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks.
In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Learning Connectivity of Neural Networks from a Topological Perspective [80.35103711638548]
We propose a topological perspective to represent a network into a complete graph for analysis.
By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner.
This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks.
arXiv Detail & Related papers (2020-08-19T04:53:31Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.