Fugu-MT 論文翻訳(概要): VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

論文の概要: VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

arxiv url: http://arxiv.org/abs/2510.18214v1
Date: Tue, 21 Oct 2025 01:30:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.756251
Title: VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Title（参考訳）: VLSU:AI安全のための共同マルチモーダル理解の限界をマッピングする
Authors: Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng,
Abstract要約: マルチモーダル安全性を評価する包括的フレームワークであるVision Language Safety Understandingを提案する。 11種類の最先端モデルについて評価した結果, 系統的な共同理解の失敗が判明した。我々のフレームワークは、現在のモデルにおける共同画像テキスト理解とアライメントギャップの弱点を明らかにする。
参考スコア（独自算出の注目度）: 3.1109025622085693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
Abstract（参考訳）: マルチモーダル基礎モデルの安全性評価は、しばしば視覚と言語入力を別々に扱うが、良性コンテンツが混在して有害になる共同解釈のリスクを欠いている。既存のアプローチは、明らかに安全でないコンテンツを境界のケースと区別することができないため、問題のある過剰なブロックや、真に有害なコンテンツの拒絶につながる。視覚言語安全理解(VLSU)は、17の異なる安全パターンにまたがって、きめ細かな重度分類と組合せ分析により、マルチモーダル安全性を体系的に評価する包括的枠組みである。実世界の画像と人間のアノテーションを備えたマルチステージパイプラインを用いて、15の有害カテゴリにまたがる8,187個のサンプルの大規模なベンチマークを構築した。その結果, 安全信号の精度が90%以上であるのに対して, 安全ラベルを決定するためには, 共同画像文による推論が必要な場合には, 性能は20～55%に低下することがわかった。最も重要なことに、共同画像・テキストの安全性分類における誤りの34%は、個々のモダリティの正しい分類にもかかわらず発生し、さらに構成的推論能力が欠如している。さらに、モデルは、エンゲージメントに値する境界線のケースに反応しながら、安全でないコンテンツの再利用のバランスをとるのに苦労していることもわかりました。例えば、ジェミニ1.5では、命令フレーミングは境界線コンテンツの過剰ブロック率を62.4%から10.4%に削減できるが、拒否率を90.8%から53.9%に下げることなく、安全でないコンテンツに対する過度な拒否のコストがかかる。全体として、我々のフレームワークは、現在のモデルにおける共同画像テキスト理解とアライメントギャップの弱点を明らかにし、堅牢な視覚言語安全性の研究における次のマイルストーンを可能にする重要なテストベッドを提供する。

論文の概要: VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

関連論文リスト