Related papers: Towards Automated Circuit Discovery for Mechanistic Interpretability

Towards Automated Circuit Discovery for Mechanistic Interpretability

URL: http://arxiv.org/abs/2304.14997v4
Date: Sat, 28 Oct 2023 20:05:52 GMT
Title: Towards Automated Circuit Discovery for Mechanistic Interpretability
Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adri\`a Garriga-Alonso
Abstract summary: This paper systematizes the mechanistic interpretability process they followed. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We propose several algorithms and reproduce previous interpretability results to validate them.
Score: 7.605075513099429
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery.

Related papers

(How) Do Language Models Track State? [50.516691979518164]
Transformer language models (LMs) exhibit behaviors that appear to require tracking the unobserved state of an evolving world. We study state tracking in LMs trained or fine-tuned to compose permutations. We show that LMs consistently learn one of two state tracking mechanisms for this task.
arXiv Detail & Related papers (2025-03-04T18:31:02Z)
Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions. We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z)
Kolmogorov GAM Networks are all you need! [0.6906005491572398]
Kolmogorov GAM networks are shown to be an efficient architecture for training and inference. They are an additive model with an embedding that is independent of the function of interest.
arXiv Detail & Related papers (2025-01-01T02:46:00Z)
Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized. We find that these random transformers can perform a wide range of meaningful algorithmic tasks. Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z)
Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z)
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition [10.13822875330178]
We introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models. CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads. We show CD-T circuits are able to perfectly replicate original models' behavior using fewer nodes than the baselines for all tasks.
arXiv Detail & Related papers (2024-07-01T01:12:20Z)
Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z)
Automatically Identifying Local and Global Circuits with Linear Computation Graphs [45.760716193942685]
We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. Our methods do not require linear approximation to compute the causal effect of each node. We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits.
arXiv Detail & Related papers (2024-05-22T17:50:04Z)
GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models. We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network. We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z)
Attribution Patching Outperforms Automated Circuit Discovery [3.8695554579762814]
We show that a simple method based on attribution patching outperforms all existing methods. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph.
arXiv Detail & Related papers (2023-10-16T12:34:43Z)
Pretraining Graph Neural Networks for few-shot Analog Circuit Modeling and Design [68.1682448368636]
We present a supervised pretraining approach to learn circuit representations that can be adapted to new unseen topologies or unseen prediction tasks. To cope with the variable topological structure of different circuits we describe each circuit as a graph and use graph neural networks (GNNs) to learn node embeddings. We show that pretraining GNNs on prediction of output node voltages can encourage learning representations that can be adapted to new unseen topologies or prediction of new circuit level properties.
arXiv Detail & Related papers (2022-03-29T21:18:47Z)
The data-driven physical-based equations discovery using evolutionary approach [77.34726150561087]
We describe the algorithm for the mathematical equations discovery from the given observations data. The algorithm combines genetic programming with the sparse regression. It could be used for governing analytical equation discovery as well as for partial differential equations (PDE) discovery.
arXiv Detail & Related papers (2020-04-03T17:21:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.