Fugu-MT 論文翻訳(概要): SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

論文の概要: SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arxiv url: http://arxiv.org/abs/2606.18322v1
Date: Tue, 16 Jun 2026 15:04:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.824903
Title: SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior
Title（参考訳）: SAEインターベンションは信頼できない:抑制行動のインターベンション後の回復
Authors: Mingyue Cui, Linghui Shen, Xingyi Yang,
Abstract要約: スパースオートエンコーダ(SAE)は残ストリームの活性化を解釈可能な特徴に分解する。特定の有害な特徴をクランプすることで,モデルの誤動作を確実に防止できることが示唆された。我々は、この脆弱性を、制約付き残空間最適化問題であるポスト・インターベンション・リカバリ(英語版)として定式化する。
参考スコア（独自算出の注目度）: 38.75847400495247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.
Abstract（参考訳）: スパースオートエンコーダ(SAE)は残ストリームの活性化を解釈可能な特徴に分解する。最近の潜航空間の防衛は、監視と介入のアクション可能なハンドラとして「安全でない」SAEの特徴を特定できると仮定して、これらの分解にますます依存している。このパラダイムでは、特定の有害な特徴をクランプすることで、モデルの誤動作を確実に防止することが期待されている。しかし、この成功は回復可能な障害モードを隠蔽する可能性を示しており、クランプは動作自体をなくすことなく、ある動作への可視経路をブロックする可能性がある。我々は、この脆弱性を、制約付き残空間最適化問題であるポスト・インターベンション・リカバリ(英語版)として定式化する。インターベンション後の残留状態から始めて, ターゲットとしたSAE特徴のインターベンション後の値を保存しながら, インターベンション前の動作を回復するために, 残留摂動を最適化する。最適化と生成を通じて介入が活発な強力な脅威モデルの下でも、リカバリは可能である。回復は単純に介入を解き放ち、単層介入のエンコーダ-直交更新とそれに対応する特徴写像ヤコビアンをクロス層設定で使用する。 TPP, 未学習, IOI, および拒絶操舵実験において, このストレステストは, 機能レベルの介入が成功したにもかかわらず, 回復可能な行動を明らかにする。特に安全クリティカルなリファレンス・ステアリングでは, 有効試料の95.8%の回収率を達成し, 保存状態の相対ドリフトを0.131とし, ほぼ接尾辞ベースライン以下とした。リカバリパス属性解析により、このリカバリは、SAEによって説明されていないコンポーネントであるSAEリカバリ残基にさらに局所化される。これらの結果は、機能レベル制御と行動完全性の間のギャップを露呈している: SAE機能は因果的介入をサポートすることができるが、それらを制御することは、基礎となる動作の制御を保証するものではない。

論文の概要: SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

関連論文リスト