Fugu-MT 論文翻訳(概要): ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

論文の概要: ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

arxiv url: http://arxiv.org/abs/2605.02647v1
Date: Mon, 04 May 2026 14:32:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.335653
Title: ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
Title（参考訳）: ContextualJailbreak: 仮想会話プライミングによる進化的レッドチーム
Authors: Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos,
Abstract要約: 大規模言語モデル(LLM)は、安全アライメントを回避し、有害な応答を誘発するジェイルブレイク攻撃に対して脆弱なままである。我々は,マルチターン素数対話をシミュレートした進化探索を行う,ブラックボックスのレッドチーム戦略であるContextualJailbreakを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a simulated multi-turn primed dialogue. The strategy leverages a graded 0-5 harm score from a two-level judge as an in-loop signal, enabling partially harmful responses to guide the search process rather than being discarded. Search is driven by five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, of which the last two are novel contributions of this work. Across 50 representative HarmBench behaviors, ContextualJailbreak achieves an ASR of 100% on gpt-oss:20B, 100% on qwen3-8B, 100% on llama3.1:70B, and 90% on gpt-oss:120B, outperforming four single- and multi-turn baselines by 31-96 percentage points on average. The 40 maximally harmful attacks discovered against gpt-oss:120B transfer without adaptation to closed frontier models, achieving 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, but only 17.5% on claude-opus-4-7 and 15.0% on claude-sonnet-4-6, revealing a pronounced provider-level asymmetry in alignment robustness.
Abstract（参考訳）: 大規模言語モデル(LLM)は、安全アライメントを回避し、有害な応答を誘発するジェイルブレイク攻撃に対して脆弱なままである。研究の活発化によって、先述の偏見を隠蔽的に反映したコンテキストプライミングが強力な攻撃面を形成しており、手作りの多ターン足場は有能なモデルでのシングルターン操作を一貫して上回っていることが示されている。しかし、自動化された最適化ベースのレッドチーム化はシングルターン設定に限られており、静的なプロンプトを反復し、どの形式の会話プライミングがコンプライアンスを誘発するかを推論する能力が欠如している。近年のマルチターン・サーチベースアプローチはこのギャップを埋め始めたが、ミューテーター設計空間の根底にある効果的な素数対話は未解明のままである。我々は,マルチターン素数対話をシミュレートした進化探索を行う,ブラックボックスのレッドチーム戦略であるContextualJailbreakを提案する。この戦略は、2段階の審査員からの0-5の無害スコアをループ内信号として活用し、部分的に有害な応答によって、破棄されるのではなく、探索プロセスのガイドを可能にする。探索は5つの意味論的に定義された突然変異演算子(ロールプレイ、シナリオ、拡張、トラブルシューティング、メカニスティック)によって駆動される。 50以上の代表的HarmBenchの振る舞いにおいて、ContextualJailbreakは、gpt-oss:20Bで100%、qwen3-8Bで100%、llama3.1:70Bで100%、gpt-oss:120Bで90%のASRを達成する。 gpt-oss:120B がクローズドフロンティアモデルに適応せず、gpt-4o-miniで90.0%、gpt-5で70.0%、gemini-3-flashで70.0%を達成したが、claude-opus-4-7で17.5%、claude-sonnet-4-6で15.0%しか検出されなかった。

論文の概要: ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

関連論文リスト