Related papers: What Can Grokking Teach Us About Learning Under Nonstationarity?

What Can Grokking Teach Us About Learning Under Nonstationarity?

URL: http://arxiv.org/abs/2507.20057v1
Date: Sat, 26 Jul 2025 20:51:24 GMT
Title: What Can Grokking Teach Us About Learning Under Nonstationarity?
Authors: Clare Lyle, Gharda Sokar, Razvan Pascanu, Andras Gyorgy,
Abstract summary: In continual learning problems, it is necessary to overwrite components of a neural network's learned representation in response to changes in the data stream.<n> neural networks often exhibit primacy bias, whereby early training data hinders the network's ability to generalize on later tasks.<n>We show that the emergence of feature-learning dynamics is known to drive the phenomenon of grokking.
Score: 21.031486400628854
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to changes in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.

Related papers

Lyapunov Learning at the Onset of Chaos [41.94295877935867]
We propose a novel training algorithm for neural networks called textitLyapunov Learning.<n>Our approach demonstrates effective and significant improvements in experiments involving regime shifts in non-stationary systems.
arXiv Detail & Related papers (2025-06-15T10:53:02Z)
A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias.<n>Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions.<n>This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z)
Early learning of the optimal constant solution in neural networks and humans [4.016584525313835]
We show that learning of a target function is preceded by an early phase in which networks learn the optimal constant solution (OCS) We show that learning of the OCS can emerge even in the absence of bias terms and is equivalently driven by generic correlations in the input data. Our work suggests the OCS as a universal learning principle in supervised, error-corrective learning.
arXiv Detail & Related papers (2024-06-25T11:12:52Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z)
Synergistic information supports modality integration and flexible learning in neural networks solving multiple tasks [107.8565143456161]
We investigate the information processing strategies adopted by simple artificial neural networks performing a variety of cognitive tasks. Results show that synergy increases as neural networks learn multiple diverse tasks. randomly turning off neurons during training through dropout increases network redundancy, corresponding to an increase in robustness.
arXiv Detail & Related papers (2022-10-06T15:36:27Z)
Improving Systematic Generalization Through Modularity and Augmentation [1.2183405753834562]
We investigate how two well-known modeling principles -- modularity and data augmentation -- affect systematic generalization of neural networks. We show that even in the controlled setting of a synthetic benchmark, achieving systematic generalization remains very difficult.
arXiv Detail & Related papers (2022-02-22T09:04:35Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Being Friends Instead of Adversaries: Deep Networks Learn from Data Simplified by Other Networks [23.886422706697882]
A different idea has been recently proposed, named Friendly Training, which consists in altering the input data by adding an automatically estimated perturbation. We revisit and extend this idea inspired by the effectiveness of neural generators in the context of Adversarial Machine Learning. We propose an auxiliary multi-layer network that is responsible of altering the input data to make them easier to be handled by the classifier.
arXiv Detail & Related papers (2021-12-18T16:59:35Z)
Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque. Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.