Fugu-MT 論文翻訳(概要): Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

論文の概要: Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

arxiv url: http://arxiv.org/abs/2605.03058v1
Date: Mon, 04 May 2026 18:27:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:43.598447
Title: Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
Title（参考訳）: コントラスト的階層的アブレーションによる大言語モデルのニューロンアンコレッド規則抽出
Authors: Francesco Sovrano, Gabriele Dominici, Marc Langheinrich,
Abstract要約: 説明可能なAI(XAI)の重要な目標は、大きな言語モデル(LLM)の決定ロジックを象徴的な形で表現することである。我々は、アゴニストと呼ばれるスパースニューロンを効率よく局在させることにより、LSM回路における規則抽出を基礎とするパイプラインであるMechaRuleを紹介する。
参考スコア（独自算出の注目度）: 5.880505093493663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.
Abstract（参考訳）: 説明可能なAI(XAI)の重要な目標は、大きな言語モデル(LLM)の決定ロジックをシンボル形式で表現し、それを内部メカニズムにリンクすることだ。グローバルな規則抽出法は、典型的には、モデル回路の基盤となる規則なしで記号的なサロゲートを学習するが、機械論的解釈性は、行動とニューロン集合を結びつけることができるが、しばしば手作りの仮説や高価なニューロンレベルの介入に依存する。我々は、アゴニストと呼ばれるスパースニューロンを効率よく局在させることにより、LCM回路におけるルール抽出を基礎とするパイプラインであるMechaRuleを紹介し、その活性化中和はルール関連挙動を阻害する。 MechaRuleは2つの経験的な観察に基づいている。まず、固定されたベースライン/フリップ状態において、スパースアゴニスト効果は概して単調で飽和し、いくつかの支配的なニューロン活性化は粗いスケールで弱いアゴニストを上回り、重なり合うニューロンは同じ例の多くを反転させる。このことは、状態条件による強度予測によって誘導される適応群テストとしてローカライゼーションを動機付け、K<<Nニューロンがモノトンオーバートッピング抽象の下でアゴニストであるとき、N候補に対するTheta(k log(N/k) + k)干渉を与える。第二に、アゴニストは、データ分割によってアゴニストが検証されると、より確実に現れる; スペクトル分割は有用なルールなしのフォールバックであり、一方不誠実なスプリットはローカライゼーションを低下させる。 Qwen2 と GPT-J にまたがる算術とjailbreak のタスクにおいて、MechaRule は96.8%の高効率なブルートフォースアゴニストを完全な比較でリコールし、局所化されたアゴニストを抑えることで、算術的精度とjailbreak の成功を 71.1% と 8.8% に削減する。

論文の概要: Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

関連論文リスト