Fugu-MT 論文翻訳(概要): Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

論文の概要: Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

arxiv url: http://arxiv.org/abs/2510.15017v1
Date: Thu, 16 Oct 2025 17:41:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-20 20:17:34.325155
Title: Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Title（参考訳）: アクティブハニーポットガードレールシステム:マルチターンLDM脱獄の検証と確認
Authors: ChenYu Wu, Yi Wang, Yang Liao,
Abstract要約: 大規模言語モデル(LLM)は、マルチターンジェイルブレイク攻撃に対してますます脆弱である。リスク回避をリスク利用に変換するハニーポット型アクティブガードレールシステムを提案する。
参考スコア（独自算出の注目度）: 5.366454120356494
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.
Abstract（参考訳）: 大規模言語モデル(LLM)は、シングルターン安全フィルタをバイパスする有害な行動を反復的に引き起こすマルチターンジェイルブレイク攻撃に対して、ますます脆弱である。既存の防御は、主に受動的拒否に依存しており、これは適応的な攻撃に対して失敗するか、過度に良心的なユーザーを制限する。リスク回避をリスク利用に変換するハニーポット型アクティブガードレールシステムを提案する。我々のフレームワークは、あいまいで、動作不可能で、セマンティックに関連のある応答を生成するために餌モデルを微調整し、ユーザの意図を調査するためのルーレとして機能する。保護されたLLMの安全な応答と組み合わせて、マルチターンインタラクションを通じて悪意のある意図を徐々に露呈する積極的な餌の質問を挿入する。さらに,Honeypot Utility Score (HUS)を導入し,ベイト応答の魅力と実現可能性の両方を測定し,安全性とユーザビリティのバランスをとるためにDefense Efficacy Rate (DER)を用いた。 GPT-4oをまたいだ最近の攻撃手法によるMHJデータセットの初期実験により、我々のシステムは、良質なユーザエクスペリエンスを維持しながら、ジェイルブレイクの成功を著しく損なうことが示された。

論文の概要: Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

関連論文リスト