From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
- URL: http://arxiv.org/abs/2508.16109v1
- Date: Fri, 22 Aug 2025 05:54:11 GMT
- Title: From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
- Authors: Karim Saraipour, Shichang Zhang,
- Abstract summary: We investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts.<n>Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities.
- Score: 5.1877231178075425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model's performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
Related papers
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z) - Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers [0.10152838128195467]
We train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification task.<n>A single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking residuals and normalization layers.<n>A two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions.
arXiv Detail & Related papers (2025-10-28T22:25:19Z) - How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z) - A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.<n>With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.<n>We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z) - Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning [9.795934690403374]
It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks.<n>We employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process.<n>We demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
arXiv Detail & Related papers (2025-02-13T07:19:05Z) - Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data [3.376269351435396]
We develop a formal perspective on probing using structural causal models (SCM)
We extend a recent study of LMs in the context of a synthetic grid-world navigation task.
Our techniques provide robust empirical evidence for the ability of LMs to induce the latent concepts underlying text.
arXiv Detail & Related papers (2024-07-18T17:59:27Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - Circuit Component Reuse Across Tasks in Transformer Language Models [32.2976613483151]
We present evidence that insights can indeed generalize across tasks.
We show that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads.
Our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
arXiv Detail & Related papers (2023-10-12T22:12:28Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - A Closer Look at Reward Decomposition for High-Level Robotic
Explanations [18.019811754800767]
We propose an explainable Q-Map learning framework that combines reward decomposition with abstracted action spaces.
We demonstrate the effectiveness of our framework through quantitative and qualitative analysis of two robotic scenarios.
arXiv Detail & Related papers (2023-04-25T16:01:42Z) - Competence-Based Analysis of Language Models [21.43498764977656]
CALM (Competence-based Analysis of Language Models) is designed to investigate LLM competence in the context of specific tasks.<n>We develop a new approach for performing causal probing interventions using gradient-based adversarial attacks.<n>We carry out a case study of CALM using these interventions to analyze and compare LLM competence across a variety of lexical inference tasks.
arXiv Detail & Related papers (2023-03-01T08:53:36Z) - Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small [68.879023473838]
We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI)
To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model.
arXiv Detail & Related papers (2022-11-01T17:08:44Z) - AR-LSAT: Investigating Analytical Reasoning of Text [57.1542673852013]
We study the challenge of analytical reasoning of text and introduce a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016.
We analyze what knowledge understanding and reasoning abilities are required to do well on this task.
arXiv Detail & Related papers (2021-04-14T02:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.