Fugu-MT 論文翻訳(概要): MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

論文の概要: MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

arxiv url: http://arxiv.org/abs/2508.00576v1
Date: Fri, 01 Aug 2025 12:19:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-04 18:08:53.871552
Title: MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models
Title（参考訳）: MultiSHAP: マルチモーダルAIモデルにおけるクロスモーダルインタラクションを説明するためのシェープベースフレームワーク
Authors: Zhanliang Wang, Kai Wang,
Abstract要約: マルチモーダルAIモデルは、視覚や言語など、複数のモーダルからの情報の統合を必要とするタスクにおいて、目覚ましいパフォーマンスを達成した。マルチモーダルAIモデルにおけるクロスモーダルインタラクションを説明するには、依然として大きな課題である。
参考スコア（独自算出の注目度）: 5.011371514152517
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their "black-box" nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - "why the model makes a specific prediction on this input", and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - "how the model integrates information across modalities". Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.
Abstract（参考訳）: マルチモーダルAIモデルは、視覚や言語など、複数のモーダルからの情報の統合を必要とするタスクにおいて、目覚ましいパフォーマンスを達成した。しかしながら、その“ブラックボックス”の性質は、解釈可能性と信頼性が不可欠であるハイステークなアプリケーションにおいて、デプロイメントにとって大きな障壁となる。マルチモーダルAIモデルにおけるクロスモーダルなインタラクションを説明するには、依然として大きな課題である。既存のモデル説明手法、例えばアテンションマップやGrad-CAMは、モーダル間の関係に関する粗い洞察を提供するが、モダリティ間の相乗効果を正確に定量化することはできず、アクセス可能な内部重みを持つオープンソースモデルに限られる。ここでは、Shapley Interaction Indexを利用したモデル非依存の解釈可能性フレームワークであるMultiSHAPを紹介し、オープンソースモデルとクローズドソースモデルの両方に適用できるとともに、マルチモーダル予測を、きめ細かいビジュアル要素とテキスト要素(画像パッチやテキストトークンなど)間のペアの相互作用に属性付けする。このアプローチは,(1) 個々のサンプルに対して相乗的かつ抑制的な相互モーダル効果を明らかにするインスタンスレベルの説明 – "なぜモデルがこの入力に対して特定の予測を行うのか" と,(2) サンプル間の一般化可能な相互作用パターンを明らかにするデータセットレベルの説明 – "モデルがモジュール間情報をどのように統合するか" を提供する。公開マルチモーダルベンチマークの実験では、MultiSHAPはクロスモーダル推論機構を忠実に捉え、実世界のケーススタディはその実用性を実証している。我々のフレームワークは2つのモダリティを超えて拡張可能であり、複雑なマルチモーダルAIモデルを解釈するための一般的なソリューションを提供する。

論文の概要: MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

関連論文リスト