Fugu-MT 論文翻訳(概要): Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

論文の概要: Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

arxiv url: http://arxiv.org/abs/2604.18510v1
Date: Mon, 20 Apr 2026 17:01:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:53.011187
Title: Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
Title（参考訳）: 有害コンプライアンスへの異なる経路: LLMジェイルブレイクにおける行動側効果と機械的多様性
Authors: Md Rysul Kabir, Zoran Tiganj,
Abstract要約: オープンウェイト言語モデルは、いくつかの異なる介入によって安全でないようにすることができる。安全でない3経路にわたるジェイルブレイクモデルの挙動と力学特性について検討する。いずれのルートも、ほぼ強制的に有害なコンプライアンスを実現するが、直接的な有害性を超えると、それらが分岐する。
参考スコア（独自算出の注目度）: 4.726777092009554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.
Abstract（参考訳）: オープンウェイト言語モデルは、いくつかの異なる介入によって、安全でないようにレンダリングすることができるが、結果として得られるモデルは、機能、行動プロファイル、内部障害モードで大きく異なる場合がある。危険監視微調整(SFT)、検証可能な報酬(RLVR)による有害強化学習(RLVR)、拒絶抑制アブリーブ化(refusal-pressing abliteration)という3つの安全でないルートにおけるジェイルブレイクモデルの挙動と力学特性について検討した。いずれのルートも、ほぼ強制的に有害なコンプライアンスを実現するが、直接的な有害性を超えると、それらが分岐する。 RLVRジェイルブレイクモデルは、構造化自己監査において最小限の劣化と明示的な害認識を保ち、有害なプロンプトを識別し、安全なLLMがどう対応すべきかを記述できるが、有害な要求に従う。 RLVRでは、有害な行動が反射性安全足場によって強く抑制される。カテゴリー特異的なRLVRジェイルブレイクは有害領域で広く一般化する。 SFTでジェイルブレイクされたモデルでは、明示的な安全性判断の最大の崩壊、行動的ドリフト、標準ベンチマークでの実質的な能力喪失が示される。消音は、自己監査と反射安全足場への応答の両方において家族依存である。機械的および修復的解析により、失語は局所的な拒絶・機能的削除と一致し、RLVRは保存された安全形状を持つが、ポリシーの振る舞いを再ターゲティングし、SFTは広範囲に分散したドリフトを持つ。目標修理は、部分的にRLVRジェイルブレイクモデルを回復するが、SFTジェイルブレイクモデルにはほとんど影響しない。これらの結果から、ジェイルブレイクは同様の有害性にもかかわらず、全く異なる特性を生み出すことができることが示され、RLVRを介してジェイルブレイクされたモデルはベースモデルと著しく類似している。

論文の概要: Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

関連論文リスト