Fugu-MT 論文翻訳(概要): GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

論文の概要: GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

arxiv url: http://arxiv.org/abs/2606.14740v1
Date: Tue, 02 Jun 2026 17:18:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-21 20:00:42.760309
Title: GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods
Title（参考訳）: GridVQA-X:マルチモーダルな説明可能性の評価フレームワーク
Authors: Sujay Belsare, Sudarshan Nikhil, Sushant Kumar, Ponnurangam Kumaraguru, Chirag Agarwal,
Abstract要約: クロスモーダルな説明可能性を評価するために設計された最初の診断フレームワークであるGridVQA-Xを紹介する。自然なデータセットとは異なり、GridVQA-Xはクローズドワールド合成ロジックを利用して、ユニークな数学的に保証された説明を生成する。提案手法は, 実空間関係推論に基づくモデルと, クロスモーダルショートカットを利用したモデルとを区別できないことがわかった。
参考スコア（独自算出の注目度）: 20.57926775700787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: $M_{\text{pure}}$, which learns robust spatial-relational reasoning and $M_{\text{spur}}$, which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.
Abstract（参考訳）: ビジョン・ランゲージ・モデルの開発が進むにつれ、それらの予測が関係する利害関係者に容易に説明できることが重要になる。しかし、説明可能性の分野はマルチモーダル・サージとペースを保っていない。最近のMxAI(Multimodal Explainable AI)手法では、異なるモーダル間の相互作用を属性とする説明が生成されているが、現在の評価プロトコルでは、真のクロスモーダル推論(例:空間構成)と浅いクロスモーダルショートカット(例:Bag-of-Words属性マッチング)を区別するために必要な基礎的真実が欠如している。 MxAI法が相乗的相互作用を忠実に捉えるのか、単に単純な特徴検出器として機能するモデルに対する幻覚的推論なのかは不明である。本稿では,クロスモーダルな説明可能性を評価するために設計された最初の診断フレームワークであるGridVQA-Xを紹介する。自然なデータセットとは異なり、GridVQA-Xはクローズドワールド合成ロジックを利用して、ユニークな数学的に保証された説明を生成する。 M_{\text{pure}}$は、堅牢な空間関係推論を学習し、$M_{\text{spur}}$は、構造的にクロスモーダルショートカットに頼らざるを得ない。忠実な説明者は、各モデルに対して明確な推論経路を報告しなければならない。提案手法は, 実際の空間関係推論に依存するモデルと, クロスモーダルなショートカットを利用するモデルとの区別に失敗し, 真のクロスモーダルなシナジーを捉え, マルチモーダルなモデルが実際にどのように決定を下すか, という重要なギャップを浮き彫りにしている。

関連論文リスト

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning [9.470507126417292]
本稿では,ロバストなマルチモーダル推論のためのフレームワークであるSABER-LLMを紹介する。まず,600Kビデオクリップからなる大規模感情推論データセットであるSABERを構築した。第2に,エビデンス抽出と推論を「知覚的推論」に分離する構造的エビデンス分解パラダイムを提案する。
論文参考訳（メタデータ） (2026-01-26T10:03:26Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
テキスト慣性(textual inertia)と呼ばれる重要な障害モードを特定し、矛盾する視覚的証拠を無視しながら、モデルは間違ったテキストに盲目的に固執する傾向がある。本稿では,多種多様なLMMの推論連鎖に摂動を構造的に注入するLogicGraph摂動プロトコルを提案する。その結果,10%未満の症例で自己修正が成功し,主に視覚的テキスト誤りの伝播に寄与することが判明した。
論文参考訳（メタデータ） (2026-01-07T16:39:34Z)
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation [79.17352367219736]
ROVERは1つのモダリティを使用して、もう1つの出力を誘導、検証、精査する。 ROVERは、相互モーダルな推論を明示的にターゲットとする、人間による注釈付きベンチマークである。
論文参考訳（メタデータ） (2025-11-03T02:27:46Z)
MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models [5.011371514152517]
マルチモーダルAIモデルは、視覚や言語など、複数のモーダルからの情報の統合を必要とするタスクにおいて、目覚ましいパフォーマンスを達成した。マルチモーダルAIモデルにおけるクロスモーダルインタラクションを説明するには、依然として大きな課題である。
論文参考訳（メタデータ） (2025-08-01T12:19:18Z)
Rethinking Explainability in the Era of Multimodal AI [9.57008593971486]
マルチモーダルAIシステムはユビキタスになり、ハイテイクなアプリケーションにまたがって優れたパフォーマンスを実現している。既存の説明可能性のテクニックの多くは単調のままであり、モダリティ固有の特徴属性、概念、回路トレースを分離して生成する。本稿では, マルチモーダルモデル決定を駆動するクロスモーダルな影響を, 体系的に誤表現し, 捉えることができないことを論じる。
論文参考訳（メタデータ） (2025-06-16T03:08:29Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
MLLM(Multimodal Large Language Models)は、一貫した視覚・テキスト入力で主に訓練され、テストされる。本稿では,MLLMの意味的ミスマッチの検出と推論能力を評価するためのマルチモーダル不整合推論ベンチマークを提案する。我々は6つの最先端MLLMを評価し、o1のような専用マルチモーダル推論能力を持つモデルは、その性能を大幅に上回っていることを示す。
論文参考訳（メタデータ） (2025-02-22T01:52:37Z)
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval [139.21955930418815]
クロスモーダル検索手法は、共通表現空間を共同学習することにより、視覚と言語モダリティの類似性関係を構築する。しかし、この予測は、低品質なデータ、例えば、腐敗した画像、速いペースの動画、詳細でないテキストによって引き起こされるアレタリック不確実性のために、しばしば信頼性が低い。本稿では, 原型に基づくAleatoric Uncertainity Quantification (PAU) フレームワークを提案する。
論文参考訳（メタデータ） (2023-09-29T09:41:19Z)
Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
言語(テキスト)と視覚(画像)のモダリティを2段階のフレームワークに組み込んだマルチモーダルCoTを提案する。その結果,ScienceQA と A-OKVQA のベンチマークは,提案手法の有効性を示した。
論文参考訳（メタデータ） (2023-02-02T07:51:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。