Related papers: Attribution Patching Outperforms Automated Circuit Discovery

Attribution Patching Outperforms Automated Circuit Discovery

URL: http://arxiv.org/abs/2310.10348v2
Date: Mon, 20 Nov 2023 11:31:16 GMT
Title: Attribution Patching Outperforms Automated Circuit Discovery
Authors: Aaquib Syed, Can Rager, Arthur Conmy
Abstract summary: We show that a simple method based on attribution patching outperforms all existing methods. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph.
Score: 3.8695554579762814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

Related papers

Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z)
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning [14.639036250438517]
We introduce a comprehensive reformulation of the task known as Circuit Discovery, along with DiscoGP. DiscoGP is a novel and effective algorithm based on differentiable masking for discovering circuits.
arXiv Detail & Related papers (2024-07-04T09:42:25Z)
Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z)
Towards Automated Circuit Discovery for Mechanistic Interpretability [7.605075513099429]
This paper systematizes the mechanistic interpretability process they followed. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We propose several algorithms and reproduce previous interpretability results to validate them.
arXiv Detail & Related papers (2023-04-28T17:36:53Z)
Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z)
Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for Deep-Learned Control Systems [11.93664682521114]
We focus our attention on feedforward neural networks with the rectified unit (ReLU) activation. We provide an accelerated algorithm for computing ROAs that leverages the incremental and connected of affine regions. Finally, we apply our methods to find a set of states that are stabilized by an image-based controller for an aircraft runway control problem.
arXiv Detail & Related papers (2022-10-15T17:15:53Z)
Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z)
MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems. We propose a novel method for computing the normalized maximum likelihood (NML) distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks. The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z)
Data-efficient Weakly-supervised Learning for On-line Object Detection under Domain Shift in Robotics [24.878465999976594]
Several object detection methods have been proposed in the literature, the vast majority based on Deep Convolutional Neural Networks (DCNNs) These methods have important limitations for robotics: Learning solely on off-line data may introduce biases, and prevents adaptation to novel tasks. In this work, we investigate how weakly-supervised learning can cope with these problems.
arXiv Detail & Related papers (2020-12-28T16:36:11Z)
Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks. In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other. This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.