Related papers: Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

URL: http://arxiv.org/abs/2511.18409v1
Date: Sun, 23 Nov 2025 11:33:59 GMT
Title: Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Authors: Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek,
Abstract summary: Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors.<n>Recently released Mechanistic Interpretability Benchmark (MIB) provides framework for evaluating circuit and causal variable localization.<n>BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques.
Score: 56.73385658981886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.

Related papers

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models [122.58252919699122]
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the decision-making of Large Language Models (LLMs)<n>We present a practical survey structured around the pipeline: "Awesomeinterventionable-MI-Survey"
arXiv Detail & Related papers (2026-01-20T14:23:23Z)
BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods [64.5040037515574]
We investigate whether ensembling two or more circuit localization methods can improve performance.<n>In parallel ensembling, we combine attribution scores assigned to each edge by different methods.<n>In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method.
arXiv Detail & Related papers (2025-10-08T09:39:40Z)
MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution [46.600316142855334]
Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions.<n>We propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution.<n>Our framework outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks.
arXiv Detail & Related papers (2025-06-17T13:35:06Z)
MIB: A Mechanistic Interpretability Benchmark [77.35046700898326]
We propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models.<n>Using MIB, we find that attribution and mask optimization methods perform best on circuit localization.<n>For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons.
arXiv Detail & Related papers (2025-04-17T17:55:45Z)
Interpreting Object-level Foundation Models via Visual Precision Search [54.575247537324344]
We propose a Visual Precision Search method that generates accurate attribution maps with fewer regions.<n>We show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics.<n>Our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics.
arXiv Detail & Related papers (2024-11-25T08:54:54Z)
Interactive incremental learning of generalizable skills with local trajectory modulation [14.416251854298409]
We propose an interactive imitation learning framework that simultaneously leverages local and global modulations of trajectory distributions.<n>Our approach exploits the concept of via-points to incrementally and interactively 1) improve the model accuracy locally, 2) add new objects to the task during execution and 3) extend the skill into regions where demonstrations were not provided.
arXiv Detail & Related papers (2024-09-09T14:22:19Z)
On the Role of Discrete Tokenization in Visual Representation Learning [35.10829554701771]
Masked image modeling (MIM) has gained popularity alongside contrastive learning methods. discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. We provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. We propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework.
arXiv Detail & Related papers (2024-07-12T08:25:31Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
Estimation of Reliable Proposal Quality for Temporal Action Detection [71.5989469643732]
We propose a new method that gives insights into moment and region perspectives simultaneously to align the two tasks by acquiring reliable proposal quality. For the moment perspective, Boundary Evaluate Module (BEM) is designed which focuses on local appearance and motion evolvement to estimate boundary quality. For the region perspective, we introduce Region Evaluate Module (REM) which uses a new and efficient sampling method for proposal feature representation.
arXiv Detail & Related papers (2022-04-25T14:33:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.