Related papers: Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers

Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers

URL: http://arxiv.org/abs/2601.09049v1
Date: Wed, 14 Jan 2026 00:40:35 GMT
Title: Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers
Authors: Kaiyu He, Zhang Mian, Peilin Wu, Xinya Du, Zhiyu Chen,
Abstract summary: We conduct a study to evaluate the Generalization Circuit's role in knowledge assimilation and transfer.<n>We argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path.
Score: 15.965423731432422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the "curse of two-hop reasoning" in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a "Generalization Circuit" during a prolonged "grokking" phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit's role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the "Generalization Circuit" does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that "grokked" Transformers do not achieve a full mastery of compositional logic.

Related papers

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent [66.78052387054593]
This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes.<n>We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task.<n>We show that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees.
arXiv Detail & Related papers (2025-08-11T17:40:47Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers [9.50669909278749]
We extend grokking to real-world factual data and address the challenge of dataset sparsity.<n>Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits.<n>Our approach achieves up to 95-100% accuracy on multi-hop reasoning benchmarks.
arXiv Detail & Related papers (2025-04-29T13:33:29Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527835]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.<n>We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z)
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge. We focus on two representative reasoning types, composition and comparison. We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z)
Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model [19.826983068662106]
We propose to study autoregressive Transformer models on a synthetic task that embodies the multi-step nature of problems where stepwise inference is generally most useful. Despite is simplicity, we find we can empirically reproduce and analyze several phenomena observed at scale.
arXiv Detail & Related papers (2024-02-12T16:25:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.