Related papers: Finding Highly Interpretable Prompt-Specific Circuits in Language Models

Finding Highly Interpretable Prompt-Specific Circuits in Language Models

URL: http://arxiv.org/abs/2602.13483v1
Date: Fri, 13 Feb 2026 21:41:17 GMT
Title: Finding Highly Interpretable Prompt-Specific Circuits in Language Models
Authors: Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella,
Abstract summary: We show that circuits are prompt-specific, even within a fixed task.<n>We introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass.<n>We develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features.
Score: 4.768156759829138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco & Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose a representative circuit for each family as a practical unit of analysis. Finally, we develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features and assemble mechanistic explanations for prompt families behavior. Together, our results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.

Related papers

From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits [5.1877231178075425]
We investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts.<n>Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities.
arXiv Detail & Related papers (2025-08-22T05:54:11Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions.<n>We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.<n>Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z)
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability [3.138731415322007]
We investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small.<n>Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges.
arXiv Detail & Related papers (2024-11-25T05:32:34Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.<n>These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models [9.56229382432426]
This research aims to reverse engineer transformer models into human-readable representations that implement algorithmic functions. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B. We show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems.
arXiv Detail & Related papers (2023-11-07T16:58:51Z)
Adaptive Planning Search Algorithm for Analog Circuit Verification [53.97809573610992]
We propose a machine learning (ML) approach, which uses less simulations. We show that the proposed approach is able to provide OCCs closer to the specifications for all circuits.
arXiv Detail & Related papers (2023-06-23T12:57:46Z)
What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions. We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm. We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z)
Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. Existing neural models have been shown to lack this basic ability in learning symbolic structures. We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.