Fugu-MT 論文翻訳(概要): PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

論文の概要: PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

arxiv url: http://arxiv.org/abs/2605.27545v1
Date: Tue, 26 May 2026 18:16:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.399484
Title: PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI
Title（参考訳）: PAST2HARM: マルチモーダルAIをジェイルブレイクするためのシンプルなアダプティブパステンス攻撃
Authors: Snehasis Mukhopadhyay,
Abstract要約: PAST2HARMは、アートマルチモーダルテキストの状態からイメージモデルへの拒絶トレーニングを回避した適応型ジェイルブレイクフレームワークである。 Gemini Nano Banana Pro, GPT Image 2, SD XLの3モデルでPAST2HARMを評価し, 攻撃成功率は83%, 67%, ブラックボックスで100%, グラデーションフリー設定で達成した。この攻撃は、露骨な性的内容、政治的偽情報、歴史的否定的物語、憎しみの言葉、自己被害の栄光など、様々な有害なアウトプットが引き起こされる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training.
Abstract（参考訳）: 安全でない画像生成は、安全でないテキストよりも深刻な結果をもたらす可能性があり、現在の防御は比較的未熟である。 PAST2HARMは、最先端のマルチモーダルテキストから画像モデルへの拒絶訓練を回避し、シンプルで効果的な適応型ジェイルブレイクフレームワークである。 PAST2HARMはこの脆弱性を、マルチモーダル生成AIにおいて体系的に活用している。我々は攻撃を2次元に沿って特徴づける。第一に、時間的深化を通じて、この枠組みは歴史的アンカーリングと考古学的手がかりを漸進的に強化し、アライメント強度の異なるモデル間での拒絶境界を侵食する。第2に,初期コンプライアンス後の反復エスカレーションを通じて有害な発生の上限を探索し,審査員として機能する言語モデルにより評価されたスカラーの重大度ジェイルブレイク測定値を用いて重大度を測定する。中間会話はピークの脆弱性ウィンドウとなり、そこではプレート化前に有害性が増加し、最終的には意味の逆転が進行する。 Gemini Nano Banana Pro, GPT Image 2, SD XLの3モデルでPAST2HARMを評価し, 攻撃成功率は83%, 67%, ブラックボックスで100%, グラデーションフリー設定で達成した。敵対的なプロンプトはモデル間の転送も行い、モデル間の成功率は50%を超えている。この攻撃は、露骨な性的内容、政治的偽情報、歴史的否定的物語、憎しみの言葉、自己被害の栄光など、様々な有害なアウトプットが引き起こされる。我々はさらに、レッドチームとアライメントのためのリソースとして、プロンプト、修正、アウトプットのキュレートされたベンチマークをリリースします。本研究は, 現行の安全対策における基本的な脆性を明らかにするとともに, より強力なマルチモーダル安全訓練の必要性を明らかにするものである。

論文の概要: PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

関連論文リスト