Fugu-MT 論文翻訳(概要): Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

論文の概要: Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

arxiv url: http://arxiv.org/abs/2604.27401v1
Date: Thu, 30 Apr 2026 04:13:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.92085
Title: Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Title（参考訳）: 摂動探査:配向LLMにおけるFFN動作回路の2パス毎プロンプト検出
Authors: Hongliang Liu, Tung-Ling Li, Yuhao Wu,
Abstract要約: 摂動探索は、大規模言語モデルにおけるFFNニューロンに対するタスク特異的因果仮説を生成する。 8つの動作回路、13のモデル、および4つのアーキテクチャファミリにまたがって、LLMの動作を整理する2つの回路構造を同定する。
参考スコア（独自算出の注目度）: 9.127363793428119
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen's concentrated FFN bottleneck to Gemma's normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.
Abstract（参考訳）: Perturbation Probing は、プロンプト毎に2つの前方通過とバックプロパゲーションのない大きな言語モデルにおいて、FFNニューロンのタスク特異的因果仮説を生成し、その後、全ての特定されたニューロンに1回の介入で約150のパスが償還される。 8つの動作回路、13のモデル、および4つのアーキテクチャファミリにまたがって、LLMの動作を整理する2つの回路構造を同定する。 RLHFが事前学習傾向を抑制すると、対位回路が現れる。安全上の拒絶では、約50ニューロン、または全ニューロンの0.01パーセントが拒絶テンプレートを制御し、520 AdvBenchの応答フォーマットの80%が変更され、ほぼゼロに近い有害なコンプライアンスが生じる。ルーティング回路は、注意を通して分散された事前学習動作に現れる。言語選択では、580ベンチマークの99.1%で英語と中国語の出力を切り替え、バイリンガルトレーニング、FFN-to-skip信号比0.3から1.1、線形表現性という3つの条件を満たす19の試験モデルのうちの3つのモデルのうち、3つは580ベンチマークのプロンプトである。同じ介入は、他の16モデルや数学、コード、事実回路で失敗し、方向制御の限界を定義する。 FFN-to-skip信号比は同じ2つのフォワードパスから計算され、2つの構造を区別し、適切な干渉を予測する。回路トポロジーはアーキテクチャによって異なり、Qwenの集中型FFNボトルネックからGemmaの正規化シールド回路まで様々である。 Qwen3.5-2Bでは、20個のニューロンを非難することで、多ターンのサイコファン性降伏を排除し、10個の関連ニューロンを増幅すると、200個のTrathfulQAプロンプトで52%から88%に修正される。これらの結果から,摂動探索はRLHFの組織的挙動に関する力学的な洞察を与え,テンプレート層編集のための実用的ツールキットを提供することが示された。

論文の概要: Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

関連論文リスト