Related papers: Explaining Grokking in Transformers through the Lens of Inductive Bias

Explaining Grokking in Transformers through the Lens of Inductive Bias

URL: http://arxiv.org/abs/2602.06702v1
Date: Fri, 06 Feb 2026 13:45:57 GMT
Title: Explaining Grokking in Transformers through the Lens of Inductive Bias
Authors: Jaisidh Singh, Diganta Misra, Antonio Orvieto,
Abstract summary: We investigate grokking in transformers through the lens of inductive bias.<n>We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed.<n>We study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale.
Score: 18.96337447499985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.

Related papers

Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent [17.629377639287775]
We focus on understanding the learning algorithm transformers use to learn from context.<n>We find that transformers still learn to do gradient descent in-context, though on functionals in the kernel feature space.<n>These theoretical findings suggest a greater adaptability to context for softmax attention.
arXiv Detail & Related papers (2025-10-12T03:20:27Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
Learnable Gabor modulated complex-valued networks for orientation robustness [4.024850952459758]
Learnable Gabor Convolutional Networks (LGCNs) are parameter-efficient and offer increased model complexity. We investigate the robustness of complex valued convolutional weights with learned Gabor filters to enable orientation transformations.
arXiv Detail & Related papers (2020-11-23T21:22:27Z)
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent [44.44543743806831]
We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
arXiv Detail & Related papers (2020-10-19T17:40:38Z)
Unsupervised Controllable Generation with Self-Training [90.04287577605723]
controllable generation with GANs remains a challenging research problem. We propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training. Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder.
arXiv Detail & Related papers (2020-07-17T21:50:35Z)
On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.