Fugu-MT 論文翻訳(概要): XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

論文の概要: XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

arxiv url: http://arxiv.org/abs/2510.15148v1
Date: Thu, 16 Oct 2025 21:10:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-20 20:17:34.399986
Title: XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Title（参考訳）: XModBench:Omni-Languageモデルにおけるクロスモーダル機能と一貫性のベンチマーク
Authors: Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, Zicheng Liu,
Abstract要約: クロスモーダル整合性を測定するための大規模トリモーダルベンチマークであるXModBenchを紹介する。 XModBenchは5つのタスクファミリーにまたがる60,828の多重選択質問で構成されている。実験によると、最強のモデルであるGemini 2.5 Proでさえ空間的および時間的推論に苦戦している。
参考スコア（独自算出の注目度）: 29.42489557439947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.
Abstract（参考訳）: Omni-Modal Large Language Model (OLLM) は、音声、視覚、テキスト理解を単一のフレームワークに統合することを目的としている。既存のベンチマークでは、一般的なクロスモーダルな質問応答能力が評価されているが、OLLMがモダリティ不変な推論を達成するか、あるいはモダリティ固有のバイアスを示すかは定かではない。 XModBenchは、クロスモーダル一貫性を明示的に測定するために設計された大規模トリモーダルベンチマークである。 XModBenchは、5つのタスクファミリーにまたがる60,828の多重選択質問からなり、問合せ対の6つのモダリティ構成を体系的にカバーし、OLLMのモダリティ不変推論、モダリティ不均一性、方向性不均衡のきめ細かい診断を可能にする。実験によると、最強のモデルであるGemini 2.5 Proさえも、 (i)空間的・時間的推論に苦慮し、精度は60%未満である。 (ii)テキストではなく音声によって同じ意味コンテンツが伝達された場合、パフォーマンスが著しく低下し、持続的なモダリティ格差が明らかになる。 (iii)系統的な方向性の不均衡を示し、視覚がテキストに比べて文脈として機能する場合の一貫性を低下させる。これらの結果から,現在のOLLMは真のモダリティ不変推論とは程遠いままであり,XModBenchはクロスモーダル能力の評価と改善のための基本的な診断ツールであることがわかった。すべてのデータおよび評価ツールはhttps://xingruiwang.github.io/projects/XModBench/で利用可能になる。

論文の概要: XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

関連論文リスト