Related papers: Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

URL: http://arxiv.org/abs/2602.16823v1
Date: Wed, 18 Feb 2026 19:41:01 GMT
Title: Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees
Authors: Itamar Hadad, Guy Katz, Shahaf Bassan,
Abstract summary: We propose a suite of automated algorithms that yield circuits with provable guarantees.<n>We focus on three types of guarantees: *input domain robustness*, *robust patching*, and *minimality*.<n>We uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms.
Score: 5.156069978876762
License: http://creativecommons.org/licenses/by/4.0/
Abstract: *Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with *provable guarantees*. We focus on three types of guarantees: (1) *input domain robustness*, ensuring the circuit agrees with the model across a continuous input region; (2) *robust patching*, certifying circuit alignment under continuous patching perturbations; and (3) *minimality*, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.

Related papers

Certified Circuits: Stability Guarantees for Mechanistic Circuits [80.30622018787835]
Certified Circuits provides provable stability guarantees for circuit discovery.<n>On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy.
arXiv Detail & Related papers (2026-02-26T13:07:31Z)
M3S-UPD: Efficient Multi-Stage Self-Supervised Learning for Fine-Grained Encrypted Traffic Classification with Unknown Pattern Discovery [10.590761201003867]
This paper proposes M3S-UPD, a novel Multi-Stage Self-Supervised Unknown-aware Packet Detection framework.<n>Key innovations include a self-supervised unknown detection mechanism that requires neither synthetic samples nor prior knowledge.<n> Experimental results show that M3S-UPD not only outperforms existing methods on the few-shot encrypted traffic classification task, but also simultaneously achieves competitive performance.
arXiv Detail & Related papers (2025-05-27T17:34:01Z)
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates [35.90665719234101]
We introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates.<n>We propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods.
arXiv Detail & Related papers (2025-05-15T07:35:14Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions.<n>We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.<n>Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z)
Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z)
Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity [18.71252449465396]
We introduce DiscoGP, a framework for extracting self-contained modular units from neural language models (LMs)<n>Our framework identifies sheaves through a gradient-based pruning algorithm that operates on both of these in such a way that reduces the original LM to a sparse skeleton that preserves certain core capabilities.
arXiv Detail & Related papers (2024-07-04T09:42:25Z)
Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning [52.70210390424605]
In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. We propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks.
arXiv Detail & Related papers (2024-04-16T04:52:41Z)
Bridging the Gap Between End-to-End and Two-Step Text Spotting [88.14552991115207]
Bridging Text Spotting is a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods. We demonstrate the effectiveness of the proposed method through extensive experiments.
arXiv Detail & Related papers (2024-04-06T13:14:04Z)
Evidential Turing Processes [11.021440340896786]
We introduce an original combination of evidential deep learning, neural processes, and neural Turing machines. We observe our method on three image classification benchmarks and two neural net architectures.
arXiv Detail & Related papers (2021-06-02T15:09:20Z)
Efficient and robust certification of genuine multipartite entanglement in noisy quantum error correction circuits [58.720142291102135]
We introduce a conditional witnessing technique to certify genuine multipartite entanglement (GME) We prove that the detection of entanglement in a linear number of bipartitions by a number of measurements scales linearly, suffices to certify GME. We apply our method to the noisy readout of stabilizer operators of the distance-three topological color code and its flag-based fault-tolerant version.
arXiv Detail & Related papers (2020-10-06T18:00:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.