Fugu-MT 論文翻訳(概要): Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

論文の概要: Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

arxiv url: http://arxiv.org/abs/2605.23157v1
Date: Fri, 22 May 2026 02:12:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.159602
Title: Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs
Title（参考訳）: 異なる弱さのモデル:フロンティアMLLMにおけるジェイルブレイク攻撃面の言語とモダリティの作り方
Authors: Casey Ford, Madison Van Doren, Sicheng Jin, Emily Dix,
Abstract要約: 米国英語(en-US)とメキシコスペイン語(es-MX)のジェイルブレイク脆弱性を比較検討した最初の体系的言語横断型マルチモーダル型レッドチーム研究について述べる。私たちの中心的な発見は、言語が脆弱性を均一にスケールしないことです。これは、言語的および視覚的なアライメント障害が、異なるメカニズムを通して機能し、切り換え言語がその分離を公開するのに十分であることを示している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.
Abstract（参考訳）: MLLM(Multimodal large language model)の攻撃面は、アライメント障害の機械的構造を明らかにする方法で言語に依存している。我々は,アメリカ英語(en-US)とメキシコスペイン語(es-MX)のジェイルブレイク脆弱性を,4つのフロンティアMLLM(Claude Sonnet 4.5, GPT-5, Pixtral Large, Qwen Omni)で比較した最初の体系的言語横断型マルチモーダルレッドチーム研究を行った。テキストのみおよびマルチモーダル条件で管理される363の多様なプロンプトシナリオの固定逆数ベンチマークを用いて、言語グループ毎に9つのネイティブスピーカーアノテータのマッチングパネルから52,272の有害評価とバイナリアタック成功判定を収集した。私たちの中心的な発見は、言語が脆弱性を均一にスケールしないことです。ベイズ混合効果分析により、ロールプレイのような言語的フレーミング攻撃は、スペイン語のプロンプトによって著しく効果が低下する一方、視覚的に明示的なマルチモーダル攻撃はより効果的になり、グローバルなアノテータの簡潔さよりも、直接的にプロンプト言語インタフェースに影響を及ぼすことが明らかになった。この解離は、言語的および視覚的アライメント障害が異なるメカニズムを通して機能し、切り換え言語がその分離を公開するのに十分であることを示している。実際の結果は、安全ランキングが言語全体にわたって保持されていないことである。 Qwen Omni は Pixtral Large を es-MX 参加者の中で最も脆弱なモデルとして取り上げ, ランク逆転による英語条件スコアのスカラー補正は不可能であり, モデル世代間での絶対攻撃成功率は, ギャップを埋めることなく低下している。これらの結果から,言語とモダリティを独立次元として扱う安全評価フレームワークは,世界展開MLLMの攻撃面を根本的に誤解し,それに応じて再設計する必要があることが示唆された。

論文の概要: Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

関連論文リスト