Related papers: Query Circuits: Explaining How Language Models Answer User Prompts

Query Circuits: Explaining How Language Models Answer User Prompts

URL: http://arxiv.org/abs/2509.24808v1
Date: Mon, 29 Sep 2025 13:59:02 GMT
Title: Query Circuits: Explaining How Language Models Answer User Prompts
Authors: Tung-Yu Wu, Fazl Barez,
Abstract summary: We introduce query circuits, which trace the information flow inside a model that maps a specific input to the output.<n>NDF is a metric to evaluate how well a discovered circuit recovers the model's decision for a specific input.<n>We find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries.
Score: 13.16677655895186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model's behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries. For example, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU questions. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.

Related papers

Finding Highly Interpretable Prompt-Specific Circuits in Language Models [4.768156759829138]
We show that circuits are prompt-specific, even within a fixed task.<n>We introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass.<n>We develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features.
arXiv Detail & Related papers (2026-02-13T21:41:17Z)
Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection [70.73201284835498]
We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers.<n>We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models.<n>We propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering.
arXiv Detail & Related papers (2026-01-14T18:43:32Z)
Self-Steering Language Models [113.96916935955842]
DisCIPL is a method for "self-steering" language models (LMs)<n>DisCIPL generates a task-specific inference program that is executed by a population of Follower models.<n>Our work opens up a design space of highly-parallelized Monte Carlo inference strategies.
arXiv Detail & Related papers (2025-04-09T17:54:22Z)
On Mechanistic Circuits for Extractive Question-Answering [47.167393805165325]
Large language models are increasingly used to process documents and facilitate question-answering on them.<n>In our paper, we extract mechanistic circuits for this real-world language modeling task.<n>We show the potential benefits of circuits towards downstream applications such as data attribution to context information.
arXiv Detail & Related papers (2025-02-12T01:54:21Z)
Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions.<n>We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.<n>Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z)
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability [3.138731415322007]
We investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small.<n>Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges.
arXiv Detail & Related papers (2024-11-25T05:32:34Z)
FLARE: Faithful Logic-Aided Reasoning and Exploration [47.46564769245296]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z)
Adversarial Circuit Evaluation [1.1893676124374688]
We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner. We measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs.
arXiv Detail & Related papers (2024-07-21T13:43:44Z)
Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery.<n>Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods.<n>Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.<n>These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.