Related papers: RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

URL: http://arxiv.org/abs/2508.21258v2
Date: Thu, 30 Oct 2025 16:01:21 GMT
Title: RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
Authors: Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda,
Abstract summary: We introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients.<n>RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness.<n>We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching.
Score: 16.22015078953355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

Related papers

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO [20.13873375670213]
TP-GRPO replaces outcome-based rewards with step-level incremental rewards.<n>It identifies turning points-steps that flip the local reward trend.<n>Turning points are detected solely via sign changes in incremental rewards.
arXiv Detail & Related papers (2026-02-06T06:37:10Z)
Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training [17.530233901658253]
Segmental Advantage Estimation mitigates the bias that Generalized Advantage Estimation can incur in Reinforcement Learning with Verifiable Rewards.<n> SAE achieves superior performance, with marked improvements in final scores, stability, and sample efficiency.
arXiv Detail & Related papers (2026-01-12T08:41:47Z)
Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability [53.21677928601684]
Layer-wise relevance propagation is one of the most promising approaches to explainability in deep learning.<n>We propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods.<n>Our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks.
arXiv Detail & Related papers (2025-06-02T18:07:55Z)
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification [62.611812892924156]
We propose Edge Patching with GradPath (EAP-GP) to address the saturation effect.<n>EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region.<n>We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL.
arXiv Detail & Related papers (2025-02-07T16:04:57Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation [0.0]
We propose a more accurate pruning metric based on the block-wise importance score propagation.<n>We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks.
arXiv Detail & Related papers (2024-12-09T11:57:16Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning [9.60349706518775]
This study addresses the challenge of inaccurate gradients in computing the empirical Fisher Information Matrix during neural network pruning. We introduce SWAP, a formulation of Entropic Wasserstein regression (EWR) for pruning, capitalizing on the geometric properties of the optimal transport problem. Our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.
arXiv Detail & Related papers (2023-10-07T21:15:32Z)
Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z)
Improving Deep Policy Gradients with Value Function Search [21.18135854494779]
This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives. We introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles.
arXiv Detail & Related papers (2023-02-20T18:23:47Z)
PEP: Parameter Ensembling by Perturbation [13.221295194854642]
Ensembling by Perturbation (PEP) constructs an ensemble of parameter values as random perturbations of the optimal parameter set from training. PEP provides a small improvement in performance, and, in some cases, a substantial improvement in empirical calibration. PEP can be used to probe the level of overfitting that occurred during training.
arXiv Detail & Related papers (2020-10-24T00:16:03Z)
Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing its Gradient Estimator Bias [65.13042449121411]
In practice, training a network with the gradient estimates provided by EP does not scale to visual tasks harder than MNIST. We show that a bias in the gradient estimate of EP, inherent in the use of finite nudging, is responsible for this phenomenon. We apply these techniques to train an architecture with asymmetric forward and backward connections, yielding a 13.2% test error.
arXiv Detail & Related papers (2020-06-06T09:36:07Z)
FPCR-Net: Feature Pyramidal Correlation and Residual Reconstruction for Optical Flow Estimation [72.41370576242116]
We propose a semi-supervised Feature Pyramidal Correlation and Residual Reconstruction Network (FPCR-Net) for optical flow estimation from frame pairs. It consists of two main modules: pyramid correlation mapping and residual reconstruction. Experiment results show that the proposed scheme achieves the state-of-the-art performance, with improvement by 0.80, 1.15 and 0.10 in terms of average end-point error (AEE) against competing baseline methods.
arXiv Detail & Related papers (2020-01-17T07:13:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.