Fugu-MT 論文翻訳(概要): The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

論文の概要: The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

arxiv url: http://arxiv.org/abs/2510.21190v1
Date: Fri, 24 Oct 2025 06:43:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 06:57:23.387211
Title: The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Title（参考訳）: トロイの木馬の例:テンプレートフィリングと安全でない推論によるLLMの脱獄
Authors: Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok Yan Lam,
Abstract要約: TrojFillはブラックボックスのジェイルブレイクで、安全でない命令をテンプレート入力タスクとして再設定する。我々はTrojFillを、主要な大規模言語モデルにまたがる標準ジェイルブレイクベンチマークで評価する。生成されたプロンプトは、以前のブラックボックス最適化アプローチと比較して、解釈可能性と転送可能性が改善されている。
参考スコア（独自算出の注目度）: 47.85771791033142
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have advanced rapidly and now encode extensive world knowledge. Despite safety fine-tuning, however, they remain susceptible to adversarial prompts that elicit harmful content. Existing jailbreak techniques fall into two categories: white-box methods (e.g., gradient-based approaches such as GCG), which require model internals and are infeasible for closed-source APIs, and black-box methods that rely on attacker LLMs to search or mutate prompts but often produce templates that lack explainability and transferability. We introduce TrojFill, a black-box jailbreak that reframes unsafe instruction as a template-filling task. TrojFill embeds obfuscated harmful instructions (e.g., via placeholder substitution or Caesar/Base64 encoding) inside a multi-part template that asks the model to (1) reason why the original instruction is unsafe (unsafety reasoning) and (2) generate a detailed example of the requested text, followed by a sentence-by-sentence analysis. The crucial "example" component acts as a Trojan Horse that contains the target jailbreak content while the surrounding task framing reduces refusal rates. We evaluate TrojFill on standard jailbreak benchmarks across leading LLMs (e.g., ChatGPT, Gemini, DeepSeek, Qwen), showing strong empirical performance (e.g., 100% attack success on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o). Moreover, the generated prompts exhibit improved interpretability and transferability compared with prior black-box optimization approaches. We release our code, sample prompts, and generated outputs to support future red-teaming research.
Abstract（参考訳）: 大規模言語モデル (LLM) は急速に進歩し、現在では広範な世界の知識をコード化している。しかし、安全の微調整にもかかわらず、有害な内容を引き出す敵のプロンプトに影響を受けやすいままである。既存のjailbreakテクニックは、モデル内部が必要でクローズドソースAPIでは利用できないホワイトボックスメソッド(GCGのような勾配ベースのアプローチ)と、プロンプトの検索や変更に攻撃的なLCMに依存するブラックボックスメソッドの2つのカテゴリに分類される。ブラックボックスのジェイルブレイクであるTrojFillを紹介します。 TrojFillは、難読化された有害な命令(プレースホルダー置換やCaesar/Base64エンコーディングなど)をマルチパートテンプレートに埋め込む。重要な"例"コンポーネントは、ターゲットのジェイルブレイク内容を含むトロイの木馬として機能し、周囲のタスクフレーミングは拒絶率を減少させる。我々は、主要なLCM(例えば、ChatGPT、Gemini、DeepSeek、Qwen)にわたる標準ジェイルブレイクベンチマークでTrojFillを評価し、強力な経験的パフォーマンス(例えば、Gemini-flash-2.5とDeepSeek-3.1で100%、GPT-4oで97%)を示した。さらに、生成されたプロンプトは、従来のブラックボックス最適化手法と比較して、解釈可能性と転送性が改善されている。コード、サンプルプロンプト、そして、将来のレッドチーム研究をサポートするために生成されたアウトプットをリリースします。

論文の概要: The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

関連論文リスト