Fugu-MT 論文翻訳(概要): Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

論文の概要: Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

arxiv url: http://arxiv.org/abs/2605.28467v1
Date: Wed, 27 May 2026 13:33:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.077112
Title: Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
Title（参考訳）: アクティベーション一貫性トレーニングによる推論モデルに対する適応的攻撃の軽減
Authors: Avidan Shah, Jannik Brinkmann, Rico Angell,
Abstract要約: 我々は、クリーンなプロンプトと敵の書き直しに同一の振る舞いを強制する微調整目的のファミリーである一貫性トレーニングについて研究する。我々はこれらの手法を即時噴射防御として定式化し、ACTが他の訓練ベースの防御と競合することを発見した。また,ACTの脱獄に対する防御は,補助ターン境界における活性化空間の概ね線形シフトとして符号化されていることを示す。
参考スコア（独自算出の注目度）: 7.873125096854494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tuning objectives that enforce identical behavior on clean prompts and adversarial rewrites, and evaluate its two main variants, output-level (BCT) and activation-level (ACT), across five reasoning models. We formulate both methods as a prompt injection defense and find ACT to be competitive with other training-based defenses while requiring only self-supervised pairs of clean and wrapped prompts. Our experiments also generalize both techniques within the jailbreak setting, demonstrating that ACT remains more robust to adaptive attacks. We also provide mechanistic evidence that ACT's defense against jailbreaks is encoded as a roughly linear shift in activation space at the assistant-turn boundary. After ACT training, we can recover a single steering direction that controls refusal on reasoning models with minimal effect on benign inputs. We find that ACT remains robust even when the model's chain-of-thought is replaced with a compliant trace from the undefended base model, pivoting to refuse prefilled jailbreaks. Together, these results suggest that supervising internal representations is a surprisingly effective and interpretable approach to various forms of safety training in reasoning models.
Abstract（参考訳）: LLMがより強力な推論能力を得るにつれて、その拡張されたチェーン・オブ・シントは、敵のジェイルブレイクに対する防御と迅速な注射のために、新しいレベルの複雑さを導入します。クリーンなプロンプトと逆向きの書き直しに同一の振る舞いを強制する微調整対象のファミリーである整合性トレーニングについて検討し、その2つの主要な変種である出力レベル(BCT)とアクティベーションレベル(ACT)を5つの推論モデルで評価した。我々はこれらの手法を即時投射防御として定式化し、ACTは他の訓練ベースの防御と競合するが、自己監督された清潔なプロンプトとラップされたプロンプトのみを必要とする。我々の実験はまた、Jailbreak設定内の両方のテクニックを一般化し、ACTが適応攻撃に対してより堅牢であることを示す。また,ACTの脱獄に対する防御は,補助ターン境界における活性化空間の概ね線形シフトとして符号化されていることを示す。 ACT訓練後、良性入力に対する最小限の影響の推論モデルに対する拒絶を制御する単一の操舵方向を復元できる。 ACTは、モデルのチェーン・オブ・シントが、未定義のベースモデルからの準拠したトレースに置き換えられた場合でも、堅牢なままであり、プリフィルド・ジェイルブレイクを拒否するためにピボットしていることが分かりました。これらの結果は、内部表現を監督することは、推論モデルにおける様々な形態の安全訓練に対する驚くほど効果的かつ解釈可能なアプローチであることを示唆している。

論文の概要: Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

関連論文リスト