Fugu-MT 論文翻訳(概要): Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

論文の概要: Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

arxiv url: http://arxiv.org/abs/2510.08859v1
Date: Thu, 09 Oct 2025 23:26:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.892056
Title: Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
Title（参考訳）: パターン強化マルチターンジェイルブレーク:大規模言語モデルにおける構造的脆弱性の爆発
Authors: Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma,
Abstract要約: マルチターンジェイルブレイク攻撃は、異なる会話アプローチによって異なる害カテゴリーをターゲットにしている。自然な対話を通して効果的なマルチターンジェイルブレイクを構築するために,PE-CoA(Pattern Enhanced Chain of Attack)を提案する。
参考スコア（独自算出の注目度）: 9.744463020852615
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA
Abstract（参考訳）: 大規模言語モデル(LLM)は、会話のコンテキストを利用して安全制約を徐々に回避するマルチターンジェイルブレイク攻撃に対して脆弱なままである。これらの攻撃は、異なる会話的アプローチ(教育的議論、個人的な経験、仮説的シナリオ)を通じて、異なる有害カテゴリー(マルウェアの生成、ハラスメント、詐欺など)をターゲットにしている。既存のマルチターンジェイルブレイク手法は、しばしばヒューリスティックまたはアドホックな探索戦略に依存し、基礎となるモデルの弱点について限られた洞察を与える。有害カテゴリー間の会話パターンとモデル脆弱性の関係はいまだよく分かっていない。自然な対話を通して効果的なマルチターンジェイルブレイクを構築するための5つの会話パターンの枠組みであるPE-CoA(Pattern Enhanced Chain of Attack)を提案する。 10の有害カテゴリにまたがる12のLLM上のPE-CoAの評価を行い、パターン固有の脆弱性とLCMの動作特性を明らかにする。これらの知見は、安全訓練の限界を強調し、パターン認識防御の必要性を示している。 https://github.com/Ragib-Amin-Nihal/PE-CoA

論文の概要: Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

関連論文リスト