Related papers: BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

URL: http://arxiv.org/abs/2511.17645v1
Date: Thu, 20 Nov 2025 06:04:34 GMT
Title: BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
Authors: Sandro Andric,
Abstract summary: We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms.<n>We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound.<n>Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.

Related papers

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z)
TorchLean: Formalizing Neural Networks in Lean [71.68907600404513]
We introduce TorchLean, a framework that treats learned models as first-class mathematical objects with a single, precise semantics shared by execution and verification.<n>We validate TorchLean end-to-end on certified robustness, physics-informed residual bounds for PINNs, and Lyapunov-style neural controller verification.
arXiv Detail & Related papers (2026-02-26T05:11:44Z)
zkCraft: Prompt-Guided LLM as a Zero-Shot Mutation Pattern Oracle for TCCT-Powered ZK Fuzzing [7.274627641804014]
zkCraft is a framework that combines deterministic, R1CS-aware localization with proof-bearing search to detect semantic inconsistencies.<n>We show that proof-bearing localization detects diverse under- and over-constrained faults with low false positives and reduces costly solver interaction.
arXiv Detail & Related papers (2026-01-31T11:31:00Z)
Theoretical Foundations of Prompt Engineering: From Heuristics to Expressivity [0.0]
We study the family of functions obtainable by holding a Transformer backbone fixed as an executor and varying only the prompt.<n>We prove a constructive existential result showing that a single fixed backbone can approximate a broad class of target behaviors via prompts alone.
arXiv Detail & Related papers (2025-12-14T13:42:20Z)
Binary Autoencoder for Mechanistic Interpretability of Large Language Models [8.725176890854065]
We propose a novel Binary Autoencoder variant that enforces minimal entropy on minibatches of hidden activations.<n>For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function.<n>We empirically evaluate and leverage to characterize the inference dynamics of Large Language Models.
arXiv Detail & Related papers (2025-09-25T10:48:48Z)
TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z)
Perturb, Attend, Detect and Localize (PADL): Robust Proactive Image Defense [5.150608040339816]
We introduce PADL, a new solution able to generate image-specific perturbations using a symmetric scheme of encoding and decoding based on cross-attention. Our method generalizes to a range of unseen models with diverse architectural designs, such as StarGANv2, BlendGAN, DiffAE, StableDiffusion and StableDiffusionXL.
arXiv Detail & Related papers (2024-09-26T15:16:32Z)
SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model. A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations. We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
CC-Cert: A Probabilistic Approach to Certify General Robustness of Neural Networks [58.29502185344086]
In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks. It is important to provide provable guarantees for deep learning models against semantically meaningful input transformations. We propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds.
arXiv Detail & Related papers (2021-09-22T12:46:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.