Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path
- URL: http://arxiv.org/abs/2602.09784v1
- Date: Tue, 10 Feb 2026 13:43:59 GMT
- Title: Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path
- Authors: Andres Saurez, Neha Sengar, Dongsoo Har,
- Abstract summary: Circuit discovery and activation steering in transformers operate on the same representational space.<n>We show they follow a single geometric principle: answer tokens, processed in isolation, encode the directions that would produce them.
- Score: 5.104181562775778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Circuit discovery and activation steering in transformers have developed as separate research threads, yet both operate on the same representational space. Are they two views of the same underlying structure? We show they follow a single geometric principle: answer tokens, processed in isolation, encode the directions that would produce them. This Circuit Fingerprint hypothesis enables circuit discovery without gradients or causal intervention -- recovering comparable structure to gradient-based methods through geometric alignment alone. We validate this on standard benchmarks (IOI, SVA, MCQA) across four model families, achieving circuit discovery performance comparable to gradient-based methods. The same directions that identify circuit components also enable controlled steering -- achieving 69.8\% emotion classification accuracy versus 53.1\% for instruction prompting while preserving factual accuracy. Beyond method development, this read-write duality reveals that transformer circuits are fundamentally geometric structures: interpretability and controllability are two facets of the same object.
Related papers
- Certified Circuits: Stability Guarantees for Mechanistic Circuits [80.30622018787835]
Certified Circuits provides provable stability guarantees for circuit discovery.<n>On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy.
arXiv Detail & Related papers (2026-02-26T13:07:31Z) - Explaining the Explainer: Understanding the Inner Workings of Transformer-based Symbolic Regression Models [3.7957452405531265]
We introduce PATCHES, an evolutionary circuit discovery algorithm that identifies compact and correct circuits for symbolic regression.<n>Using PATCHES, we isolate 28 circuits, providing the first circuit-level characterisation of an SR transformer.
arXiv Detail & Related papers (2026-02-03T13:27:10Z) - Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent [66.78052387054593]
This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes.<n>We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task.<n>We show that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees.
arXiv Detail & Related papers (2025-08-11T17:40:47Z) - EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification [62.611812892924156]
We propose Edge Patching with GradPath (EAP-GP) to address the saturation effect.<n>EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region.<n>We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL.
arXiv Detail & Related papers (2025-02-07T16:04:57Z) - Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions.<n>We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.<n>Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z) - Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation.
We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit.
The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z) - Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery.<n>Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods.<n>Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z) - Automatically Identifying Local and Global Circuits with Linear Computation Graphs [45.760716193942685]
We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders.
Our methods do not require linear approximation to compute the causal effect of each node.
We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits.
arXiv Detail & Related papers (2024-05-22T17:50:04Z) - How Transformers Learn Causal Structure with Gradient Descent [44.31729147722701]
Self-attention allows transformers to encode causal structure.
We introduce an in-context learning task that requires learning latent causal structure.
We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z) - Analyzing Transformer Dynamics as Movement through Embedding Space [0.0]
This paper explores how Transformer based language models exhibit intelligent behaviors such as understanding natural language.
We propose framing Transformer dynamics as movement through embedding space.
arXiv Detail & Related papers (2023-08-21T17:21:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.