Progress measures for grokking via mechanistic interpretability
- URL: http://arxiv.org/abs/2301.05217v3
- Date: Thu, 19 Oct 2023 21:25:32 GMT
- Title: Progress measures for grokking via mechanistic interpretability
- Authors: Neel Nanda and Lawrence Chan and Tom Lieberum and Jess Smith and Jacob
Steinhardt
- Abstract summary: We study the recently-discovered phenomenon of grokking'' exhibited by small transformers trained on modular addition tasks.
Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights.
- Score: 27.35925102247588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks often exhibit emergent behavior, where qualitatively new
capabilities arise from scaling up the amount of parameters, training data, or
training steps. One approach to understanding emergence is to find continuous
\textit{progress measures} that underlie the seemingly discontinuous
qualitative changes. We argue that progress measures can be found via
mechanistic interpretability: reverse-engineering learned behaviors into their
individual components. As a case study, we investigate the recently-discovered
phenomenon of ``grokking'' exhibited by small transformers trained on modular
addition tasks. We fully reverse engineer the algorithm learned by these
networks, which uses discrete Fourier transforms and trigonometric identities
to convert addition to rotation about a circle. We confirm the algorithm by
analyzing the activations and weights and by performing ablations in Fourier
space. Based on this understanding, we define progress measures that allow us
to study the dynamics of training and split training into three continuous
phases: memorization, circuit formation, and cleanup. Our results show that
grokking, rather than being a sudden shift, arises from the gradual
amplification of structured mechanisms encoded in the weights, followed by the
later removal of memorizing components.
Related papers
- Transformers for Supervised Online Continual Learning [11.270594318662233]
We propose a method that leverages transformers' in-context learning capabilities for online continual learning.
Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.
arXiv Detail & Related papers (2024-03-03T16:12:20Z) - How Transformers Learn Causal Structure with Gradient Descent [49.808194368781095]
Self-attention allows transformers to encode causal structure.
We introduce an in-context learning task that requires learning latent causal structure.
We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z) - Unsupervised Learning of Invariance Transformations [105.54048699217668]
We develop an algorithmic framework for finding approximate graph automorphisms.
We discuss how this framework can be used to find approximate automorphisms in weighted graphs in general.
arXiv Detail & Related papers (2023-07-24T17:03:28Z) - Can Transformers Learn to Solve Problems Recursively? [9.5623664764386]
This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification.
By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions.
arXiv Detail & Related papers (2023-05-24T04:08:37Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Unveiling Transformers with LEGO: a synthetic reasoning task [23.535488809197787]
We study how the transformer architecture learns to follow a chain of reasoning.
In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning.
We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
arXiv Detail & Related papers (2022-06-09T06:30:17Z) - Thalamus: a brain-inspired algorithm for biologically-plausible
continual learning and disentangled representations [0.0]
Animals thrive in a constantly changing environment and leverage the temporal structure to learn causal representations.
We introduce a simple algorithm that uses optimization at inference time to generate internal representations of temporal context.
We show that a network trained on a series of tasks using traditional weight updates can infer tasks dynamically.
We then alternate between the weight updates and the latent updates to arrive at Thalamus, a task-agnostic algorithm capable of discovering disentangled representations in a stream of unlabeled tasks.
arXiv Detail & Related papers (2022-05-24T01:29:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.