Fugu-MT 論文翻訳(概要): Jailbreaking Frontier Foundation Models Through Intention Deception

論文の概要: Jailbreaking Frontier Foundation Models Through Intention Deception

arxiv url: http://arxiv.org/abs/2604.24082v1
Date: Mon, 27 Apr 2026 06:12:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.766263
Title: Jailbreaking Frontier Foundation Models Through Intention Deception
Title（参考訳）: 意図的騙しによるフロンティア財団モデルの脱獄
Authors: Xinhe Wang, Katia Sycara, Yaqi Xie,
Abstract要約: 大きな(ビジョン-)モデルは優れた能力を示すが、ジェイルブレイクの影響を受けやすい。本稿では,この脆弱性を利用した新しいマルチターンジェイルブレイク手法を提案する。当社のアプローチでは、パラジェイルブレークと呼ばれる新たなモデルの脆弱性も発見しました。
参考スコア（独自算出の注目度）: 6.119674554651102
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.
Abstract（参考訳）: 大きな(視覚的な)言語モデルは優れた能力を示すが、ジェイルブレイクの影響を受けやすい。既存の安全トレーニングアプローチは、モデルの利用者の意図に基づいて、安全と安全の間の拒絶境界を学習させることを目的としている。このバイナリトレーニング体制は、特に攻撃者が意図を混乱させた場合、ユーザーの意図を確実に評価することができず、システムの不安定さを損なうことがしばしば見出されている。これに対して, GPT-5 などのフロンティアモデルでは, 安全制約を遵守しながら, 利便性を最大化することを目的とした, 拒絶ベースの安全ガードからセーフコンプリートへと移行している。しかし、ユーザーが自分の意図を無視しているふりをすると、安全な完了が悪用される可能性がある。具体的には、この意図の逆転はマルチターン会話において有効であり、攻撃者は知覚的に良識のある意図を補強する複数の機会を持つ。本研究では,この脆弱性を利用した新しいマルチターンジェイルブレイク手法を提案する。我々のアプローチは、良心的な意図をシミュレートし、モデルの一貫性性を活かし、最終的にターゲットモデルを有害で詳細な出力へと導くことによって、徐々に会話信頼を構築していく。最も重要なことに、当社のアプローチは、これまで気付かれていなかったパラジェイルブレークと呼ばれる、新たなモデルの脆弱性も発見しました。パラジェイルブレイク(Para-jailbreaking)は、攻撃クエリに対する有害な直接応答をモデルが明らかにしない状況を記述する。私たちの貢献は3倍です。まず、GPT-5-thinkingやClaude-Sonnet-4.5といったフロンティアモデルに対して高い成功率を達成する。第2に, パラジェイル破砕による有害なアウトプットを明らかにし, 対処した。第3に、マルチモーダルVLMモデルに対する実験により、我々のアプローチは最先端モデルよりも優れていた。

論文の概要: Jailbreaking Frontier Foundation Models Through Intention Deception

関連論文リスト