Fugu-MT 論文翻訳(概要): Jailbreaking Large Language Models Through Content Concretization

論文の概要: Jailbreaking Large Language Models Through Content Concretization

arxiv url: http://arxiv.org/abs/2509.12937v1
Date: Tue, 16 Sep 2025 10:34:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.037299
Title: Jailbreaking Large Language Models Through Content Concretization
Title（参考訳）: コンテンツ拡張による大規模言語モデルのジェイルブレーク
Authors: Johan Wahréus, Ahmed Hussain, Panos Papadimitratos,
Abstract要約: 大きな言語モデル(LLM)は、タスクの自動化とコンテンツ生成のためにますます多くデプロイされている。本稿では,抽象的な悪意ある要求を具体的かつ実行可能な実装に変換する新しいジェイルブレイク技術であるtextitContent Concretization (CC)を紹介する。
参考スコア（独自算出の注目度）: 1.5599296461516985
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7\% (no refinements) to 62\% after three refinement iterations, while maintaining a cost of 7.5\textcent~per prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.
Abstract（参考訳）: 大きな言語モデル(LLM)は、タスク自動化とコンテンツ生成のためにますますデプロイされているが、その安全性メカニズムは、異なるジェイルブレイク技術による回避に弱いままである。本稿では,抽象的な悪意ある要求を具体的かつ実行可能な実装に反復的に変換する,新しいジェイルブレイク手法である‘textit{Content Concretization} (CC) を紹介する。 CC は2段階のプロセスである: まず、低層で制約の少ない安全フィルタモデルを使用して初期 LLM 応答を生成し、その後、予備出力と元のプロンプトの両方を処理する高層モデルを通してそれらを精錬する。我々は350のサイバーセキュリティ特異的プロンプトを用いて評価を行い、ジェイルブレイク成功率(SR)を大幅に改善し、3回のリファインメントイテレーションの後に7\%(改善なし)から62\%に増加し、7.5\textcent〜perプロンプトのコストを維持した。 9つの異なるLCM評価器間の比較A/Bテストでは、追加の精錬工程からの出力がより悪質で技術的に優れていると一貫して評価されていることが確認されている。さらに、手動のコード解析では、生成した出力は最小限の変更で実行されるが、最適なデプロイメントは通常ターゲット固有の微調整を必要とする。最終的に有害なコード生成が改善されたことにより、これらの結果は現在のLLMセーフティフレームワークにおける重大な脆弱性を浮き彫りにする。

論文の概要: Jailbreaking Large Language Models Through Content Concretization

関連論文リスト