Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- URL: http://arxiv.org/abs/2405.15071v3
- Date: Wed, 30 Oct 2024 18:47:53 GMT
- Title: Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Authors: Boshi Wang, Xiang Yue, Yu Su, Huan Sun,
- Abstract summary: We study whether transformers can learn to implicitly reason over parametric knowledge.
We focus on two representative reasoning types, composition and comparison.
We find that transformers can learn implicit reasoning, but only through grokking.
- Score: 22.033370572209744
- License:
- Abstract: We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.
Related papers
- Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data? [55.90575874130038]
Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources.
We introduce a synthetic learning task to validate the potential of Transformers in replicating this skill.
We find that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT.
arXiv Detail & Related papers (2025-01-27T08:34:38Z) - Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.
The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment.
We propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
arXiv Detail & Related papers (2025-01-01T07:05:32Z) - Adversarial Robustness of In-Context Learning in Transformers for Linear Regression [23.737606860443705]
This work investigates the vulnerability of in-context learning in transformers to textithijacking attacks focusing on the setting of linear regression tasks.
We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions.
We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning.
arXiv Detail & Related papers (2024-11-07T21:25:58Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - When can transformers reason with abstract symbols? [25.63285482210457]
We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set.
This is in contrast to classical fully-connected networks, which we prove fail to learn to reason.
arXiv Detail & Related papers (2023-10-15T06:45:38Z) - Birth of a Transformer: A Memory Viewpoint [25.294093283819443]
Large language models based on transformers have achieved great empirical successes.
As they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable.
We study how transformers balance these two types of distributions of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigrams.
arXiv Detail & Related papers (2023-06-01T15:30:33Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.