Related papers: Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

URL: http://arxiv.org/abs/2510.03282v1
Date: Sun, 28 Sep 2025 18:34:43 GMT
Title: Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework
Authors: Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Jayvart Sharma, Ryan Lagasse,
Abstract summary: This research proposes a hybrid attribution and pruning framework that uses attribution patching to identify a high-potential subgraph.<n>We show that HAP is 46% faster than baseline algorithms without sacrificing circuit faithfulness.
Score: 4.336808542533343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46\% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.

Related papers

Certified Circuits: Stability Guarantees for Mechanistic Circuits [80.30622018787835]
Certified Circuits provides provable stability guarantees for circuit discovery.<n>On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy.
arXiv Detail & Related papers (2026-02-26T13:07:31Z)
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization [17.316927027489506]
Automated Circuit Discovery (ACDC) has emerged as a pivotal methodology in circuit discovery.<n>But its application to large language models is severely limited by computational inefficiency and prohibitively high memory requirements.<n>Our proposed method for accelerating automated circuit discovery, Per Attention Head Quantization (PAHQ), takes a fundamentally different approach by optimizing the efficiency of each individual patching operation.
arXiv Detail & Related papers (2025-10-27T12:24:14Z)
Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions.<n>We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.<n>Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z)
Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z)
Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery.<n>Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods.<n>Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z)
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms [35.514624827207136]
Edge attribution patching (EAP), gradient-based approximation to interventions, has emerged as a scalable but imperfect solution to this problem. We introduce a new method - EAP with integrated gradients (EAP-IG) - that aims to better maintain a core property of circuits: faithfulness. Our experiments demonstrate that circuits found using EAP are less faithful than those found using EAP-IG, even though both have high node overlap with circuits found previously using causal interventions.
arXiv Detail & Related papers (2024-03-26T15:44:58Z)
Attribution Patching Outperforms Automated Circuit Discovery [3.8695554579762814]
We show that a simple method based on attribution patching outperforms all existing methods. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph.
arXiv Detail & Related papers (2023-10-16T12:34:43Z)
Adaptive Planning Search Algorithm for Analog Circuit Verification [53.97809573610992]
We propose a machine learning (ML) approach, which uses less simulations. We show that the proposed approach is able to provide OCCs closer to the specifications for all circuits.
arXiv Detail & Related papers (2023-06-23T12:57:46Z)
CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization [61.71504948770445]
We propose a novel channel pruning method via Class-Aware Trace Ratio Optimization (CATRO) to reduce the computational burden and accelerate the model inference. We show that CATRO achieves higher accuracy with similar cost or lower cost with similar accuracy than other state-of-the-art channel pruning algorithms. Because of its class-aware property, CATRO is suitable to prune efficient networks adaptively for various classification subtasks, enhancing handy deployment and usage of deep networks in real-world applications.
arXiv Detail & Related papers (2021-10-21T06:26:31Z)
DHP: Differentiable Meta Pruning via HyperNetworks [158.69345612783198]
This paper introduces a differentiable pruning method via hypernetworks for automatic network pruning. Latent vectors control the output channels of the convolutional layers in the backbone network and act as a handle for the pruning of the layers. Experiments are conducted on various networks for image classification, single image super-resolution, and denoising.
arXiv Detail & Related papers (2020-03-30T17:59:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.