Fugu-MT 論文翻訳(概要): Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

論文の概要: Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

arxiv url: http://arxiv.org/abs/2508.09218v1
Date: Mon, 11 Aug 2025 18:57:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.629844
Title: Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity
Title（参考訳）: オントピネスとOOD強度のバランスによる効果的なMLLM脱獄に向けて
Authors: Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao,
Abstract要約: マルチモーダル大言語モデル(MLLM)は視覚言語推論タスクで広く使われている。 MLLMは、安全機構が有害な出力の発生を防ぐのに失敗するため、敵のプロンプトに対して脆弱である。本研究では,入力オントピー性,出力アウトオブディストリビューション(OOD)強度,出力有害性,出力拒否率を考慮した4軸評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 24.809329513705915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as "successful" are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67\%$ and harmfulness by $21\%$, revealing a previously underappreciated weakness in current multimodal safety systems.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は視覚言語推論タスクで広く使われている。しかしながら、有害なアウトプットの発生を防ぐ安全機構がしばしば失敗するため、敵のプロンプトに対する脆弱性は深刻な懸念である。最近のジェイルブレイク戦略は高い成功率を報告しているが、"successful"に分類された多くの反応は、実際には良心的、曖昧で、意図された悪意のある目標とは無関係である。このミスマッチは、現在の評価基準がそのような攻撃の有効性を過大評価している可能性を示唆している。この問題に対処するために,入力オントピー性,出力アウトオブディストリビューション(OOD)強度,出力有害性,出力拒否率を考慮した4軸評価フレームワークを提案する。このフレームワークは、真に効果的なジェイルブレイクを特定します。高度にオントピー的なプロンプトは安全フィルタによって頻繁にブロックされるが、OODが多すぎるものは検出を回避できるが有害なコンテンツを生成できない。しかし、バランス関係と新規性はフィルターを回避し、危険な出力を発生させる可能性が高くなる。この知見に基づいて、我々は、平衡構造分解(BSD)と呼ばれる再帰的書き換え戦略を開発する。このアプローチは、悪意のあるプロンプトをセマンティックにアライメントされたサブタスクに再構成し、微妙なOOD信号や視覚的手がかりを導入し、入力を検出しにくくする。 BSDは13の商用およびオープンソースのMLLMでテストされた。従来の方法と比較すると、成功率を67 %、有害度を21 %改善し、従来のマルチモーダル安全システムでは未承認の弱点が浮かび上がっている。

論文の概要: Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

関連論文リスト