Towards Automated Circuit Discovery for Mechanistic Interpretability
- URL: http://arxiv.org/abs/2304.14997v4
- Date: Sat, 28 Oct 2023 20:05:52 GMT
- Title: Towards Automated Circuit Discovery for Mechanistic Interpretability
- Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan
Heimersheim, Adri\`a Garriga-Alonso
- Abstract summary: This paper systematizes the mechanistic interpretability process they followed.
By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component.
We propose several algorithms and reproduce previous interpretability results to validate them.
- Score: 7.605075513099429
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Through considerable effort and intuition, several recent works have
reverse-engineered nontrivial behaviors of transformer models. This paper
systematizes the mechanistic interpretability process they followed. First,
researchers choose a metric and dataset that elicit the desired model behavior.
Then, they apply activation patching to find which abstract neural network
units are involved in the behavior. By varying the dataset, metric, and units
under investigation, researchers can understand the functionality of each
component. We automate one of the process' steps: to identify the circuit that
implements the specified behavior in the model's computational graph. We
propose several algorithms and reproduce previous interpretability results to
validate them. For example, the ACDC algorithm rediscovered 5/5 of the
component types in a circuit in GPT-2 Small that computes the Greater-Than
operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which
were manually found by previous work. Our code is available at
https://github.com/ArthurConmy/Automatic-Circuit-Discovery.
Related papers
- Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation.
We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit.
The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z) - Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition [10.13822875330178]
We introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models.
CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads.
We show CD-T circuits are able to perfectly replicate original models' behavior using fewer nodes than the baselines for all tasks.
arXiv Detail & Related papers (2024-07-01T01:12:20Z) - Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery.
Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods.
Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z) - Automatically Identifying Local and Global Circuits with Linear Computation Graphs [45.760716193942685]
We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders.
Our methods do not require linear approximation to compute the causal effect of each node.
We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits.
arXiv Detail & Related papers (2024-05-22T17:50:04Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Attribution Patching Outperforms Automated Circuit Discovery [3.8695554579762814]
We show that a simple method based on attribution patching outperforms all existing methods.
We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph.
arXiv Detail & Related papers (2023-10-16T12:34:43Z) - Pretraining Graph Neural Networks for few-shot Analog Circuit Modeling
and Design [68.1682448368636]
We present a supervised pretraining approach to learn circuit representations that can be adapted to new unseen topologies or unseen prediction tasks.
To cope with the variable topological structure of different circuits we describe each circuit as a graph and use graph neural networks (GNNs) to learn node embeddings.
We show that pretraining GNNs on prediction of output node voltages can encourage learning representations that can be adapted to new unseen topologies or prediction of new circuit level properties.
arXiv Detail & Related papers (2022-03-29T21:18:47Z) - The data-driven physical-based equations discovery using evolutionary
approach [77.34726150561087]
We describe the algorithm for the mathematical equations discovery from the given observations data.
The algorithm combines genetic programming with the sparse regression.
It could be used for governing analytical equation discovery as well as for partial differential equations (PDE) discovery.
arXiv Detail & Related papers (2020-04-03T17:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.