Fugu-MT 論文翻訳(概要): Bypassing Prompt Guards in Production with Controlled-Release Prompting

論文の概要: Bypassing Prompt Guards in Production with Controlled-Release Prompting

arxiv url: http://arxiv.org/abs/2510.01529v1
Date: Thu, 02 Oct 2025 00:04:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.908976
Title: Bypassing Prompt Guards in Production with Controlled-Release Prompting
Title（参考訳）: プロンプトプロンプティングによるプロンプトガードの回避
Authors: Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang,
Abstract要約: 我々は、彼らの制限を強調して、プロンプトガードを回避できる新しい攻撃を導入する。我々の手法は、応答品質を維持しながら生産モデルを継続的にジェイルブレイクする。これは、現代のLLMアーキテクチャにおいて、軽量プロンプトガードに固有の攻撃面を明らかにしている。
参考スコア（独自算出の注目度）: 11.65770031195044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) advance, ensuring AI safety and alignment is paramount. One popular approach is prompt guards, lightweight mechanisms designed to filter malicious queries while being easy to implement and update. In this work, we introduce a new attack that circumvents such prompt guards, highlighting their limitations. Our method consistently jailbreaks production models while maintaining response quality, even under the highly protected chat interfaces of Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), and Mistral Le Chat (Magistral). The attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals an attack surface inherent to lightweight prompt guards in modern LLM architectures and underscores the need to shift defenses from blocking malicious inputs to preventing malicious outputs. We additionally identify other critical alignment issues, such as copyrighted data extraction, training data extraction, and malicious response leakage during thinking.
Abstract（参考訳）: 大規模言語モデル(LLM)が進むにつれ、AIの安全性とアライメントの確保が最重要である。人気のあるアプローチはプロンプトガードで、悪意のあるクエリをフィルタリングする軽量なメカニズムで、実装と更新が容易である。本研究では,このような突発的な警備を回避し,その限界を強調する新たな攻撃を導入する。我々は,Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), Mistral Le Chat (Magistral) の高度に保護されたチャットインターフェースの下でも,応答品質を維持しながら生産モデルを継続的にジェイルブレイクする。この攻撃はプロンプトガードとメインLLMの間のリソース非対称性を利用しており、ジェイルブレイクプロンプトを符号化することで、ライトウェイトガードは復号できないが、メインモデルは復号できる。このことは、現代のLLMアーキテクチャの軽量プロンプトガードに固有の攻撃面を明らかにし、悪意のある入力をブロックして悪意のある出力を防ぐために防御をシフトする必要性を強調している。また、著作権付きデータ抽出、トレーニングデータ抽出、思考中の悪意ある応答リークなど、他の重要なアライメント問題も特定する。

論文の概要: Bypassing Prompt Guards in Production with Controlled-Release Prompting

関連論文リスト