Fugu-MT 論文翻訳(概要): Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

論文の概要: Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

arxiv url: http://arxiv.org/abs/2606.21082v1
Date: Fri, 19 Jun 2026 04:05:43 GMT
ステータス: 情報取得中
システム内更新日: 2026-06-23 11:18:07.622317
Title: Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations
Title（参考訳）: 長期会話におけるマルチターンジェイルブレーク検出のためのスケーラブル階層型アテンション変換器
Authors: Chenhui Hu, Muhammed Salih, Sudipto Guha, Subramanian Srinivasan,
Abstract要約: マルチターンジェイルブレイクは、対話中に安全でない意図を広げることで、ターンレベルのモデレーションを回避することができる。本稿では,高コストの長文結合を回避する効率的な階層型検出器を提案する。提案手法は14,038会話のベンチマークで0.9394のF1を達成する。
参考スコア（独自算出の注目度）: 1.8565979134741906
License:
Abstract: Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classification problem and introduce an efficient hierarchical detector that avoids expensive long-context concatenation while retaining cross-turn reasoning. The model encodes individual turns to form compact turn representations and applies a lightweight conversation module that captures dialogue dynamics and selectively attends to fine-grained evidence when needed. On a challenging evaluation benchmark of 14,038 conversations, our approach achieves an F1 of 0.9394, outperforming Claude Opus 4.7, the strongest competing baseline, by 0.07 while halving its false-positive rate. Ablation studies confirm that each architectural component contributes meaningfully, with combining cross-attention and self-attention in the conversation module yielding a 2.26 percentage point reduction in false-positive rate over the self-attention-only variant.
Abstract（参考訳）: マルチターンジェイルブレイクは、段階的なエスカレーション、リフレーミング、ロール操作を通じて、会話全体に安全でない意図を広げることで、ターンレベルのモデレーションを回避することができる。マルチターンジェイルブレイク検出を対話レベルの分類問題として扱い,クロスターン推論を保ちながら高コストの長文結合を回避する効率的な階層型検出器を提案する。モデルは個々のターンを符号化してコンパクトなターン表現を形成し、対話のダイナミクスをキャプチャし、必要に応じてきめ細かい証拠に選択的に出席する軽量な会話モジュールを適用する。 14,038対会話の挑戦的評価ベンチマークにおいて,提案手法はF1の0.9394を達成し,最強の競合ベースラインであるClaude Opus 4.7を0.07で上回り,偽陽性率を半減させた。アブレーション研究は、各アーキテクチャコンポーネントが、会話モジュールにおけるクロスアテンションと自己アテンションを組み合わせることで、自己アテンションのみの変種よりも2.26ポイントの偽陽性率を減少させることで有意義に寄与することを確認する。

論文の概要: Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

関連論文リスト