BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
- URL: http://arxiv.org/abs/2510.25786v1
- Date: Tue, 28 Oct 2025 15:49:34 GMT
- Title: BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
- Authors: Yaniv Nikankin, Dana Arad, Itay Itzhak, Anja Reusch, Adi Simhi, Gal Kesten-Pomeranz, Yonatan Belinkov,
- Abstract summary: We propose three key improvements to circuit discovery.<n>First, we use bootstrapping to identify edges with consistent attribution scores.<n>Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges.<n>Third, we replace the standard greedy selection with an integer linear programming formulation.
- Score: 35.326040728422576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.
Related papers
- BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods [64.5040037515574]
We investigate whether ensembling two or more circuit localization methods can improve performance.<n>In parallel ensembling, we combine attribution scores assigned to each edge by different methods.<n>In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method.
arXiv Detail & Related papers (2025-10-08T09:39:40Z) - Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework [4.336808542533343]
This research proposes a hybrid attribution and pruning framework that uses attribution patching to identify a high-potential subgraph.<n>We show that HAP is 46% faster than baseline algorithms without sacrificing circuit faithfulness.
arXiv Detail & Related papers (2025-09-28T18:34:43Z) - MIB: A Mechanistic Interpretability Benchmark [77.35046700898326]
We propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models.<n>Using MIB, we find that attribution and mask optimization methods perform best on circuit localization.<n>For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons.
arXiv Detail & Related papers (2025-04-17T17:55:45Z) - Towards Reliable AI Model Deployments: Multiple Input Mixup for
Out-of-Distribution Detection [4.985768723667418]
We propose a novel and simple method to solve the Out-of-Distribution (OOD) detection problem.
Our method can help improve the OOD detection performance with only single epoch fine-tuning.
Our method does not require training the model from scratch and can be attached to the classifier simply.
arXiv Detail & Related papers (2023-12-24T15:31:51Z) - Lookback for Learning to Branch [77.32867454769936]
Bipartite Graph Neural Networks (GNNs) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers.
Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) in branch-and-bound (B&B) solvers.
arXiv Detail & Related papers (2022-06-30T02:33:32Z) - Mutual-Information Based Few-Shot Classification [34.95314059362982]
We introduce Transductive Infomation Maximization (TIM) for few-shot learning.
Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task.
We propose a new alternating-direction solver, which speeds up transductive inference over gradient-based optimization.
arXiv Detail & Related papers (2021-06-23T09:17:23Z) - DORB: Dynamically Optimizing Multiple Rewards with Bandits [101.68525259222164]
Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks.
We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit)
We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
arXiv Detail & Related papers (2020-11-15T21:57:47Z) - Stepwise Model Selection for Sequence Prediction via Deep Kernel
Learning [100.83444258562263]
We propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting.
In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions.
We are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose.
arXiv Detail & Related papers (2020-01-12T09:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.