Fugu-MT 論文翻訳(概要): 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

論文の概要: 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

arxiv url: http://arxiv.org/abs/2508.20068v1
Date: Wed, 27 Aug 2025 17:22:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-28 19:07:41.722539
Title: 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
Title（参考訳）: 11Plus-Bench:認知に着想を得たマルチモーダルLLM空間推論
Authors: Chengzu Li, Wenshan Wu, Huanyu Zhang, Qingtao Li, Zeyu Gao, Yan Xia, José Hernández-Orallo, Ivan Vulić, Furu Wei,
Abstract要約: 本研究では,最先端MLLMの空間的推論能力を評価するためのシステム評価フレームワークを提案する。 14個のMLLMの実験と人間の評価により、現在のMLLMは空間認知の早期の兆候を示すことが明らかとなった。これらの知見は,現在のMLLMの空間的推論能力の出現能力と限界の両方を浮き彫りにしている。
参考スコア（独自算出の注目度）: 54.24689751375923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.
Abstract（参考訳）: 人間の認知過程において、空間的推論と知覚は密接に絡み合っているが、この相互作用の性質は、マルチモーダルな大言語モデル(MLLM)の評価において過小評価されている。近年のMLLMの進歩は推論において顕著な性能を示したが、人間のような空間認知能力は未解決のままである。本研究では,ヒトのパフォーマンスに対する最先端MLLMの空間的推論能力を評価するための体系的評価フレームワークを提案する。私たちの研究の中心は11Plus-Benchです。これは、現実的な標準化された空間適性テストから派生した高品質なベンチマークです。 11Plus-Benchはまた、知覚複雑性と推論プロセスの両方の詳細な専門家アノテーションを備えており、モデル動作のインスタンスレベルの詳細な分析を可能にしている。 14個のMLLMにまたがる広範囲な実験と人間の評価により、現在のMLLMは空間認知の早期の兆候を示すことが明らかとなった。 MLLMの認知的プロファイルは、人間に比べて大きなパフォーマンスの差があるにもかかわらず、認知的努力が推論に関連した複雑さと強く関連しているという人間のものと類似している。しかし、MLLMのインスタンスレベルの性能は大半がランダムであり、人間の正しさは予測可能であり、抽象的なパターンの複雑さによって形作られる。これらの知見は、現在のMLLMの空間推論能力の出現能力と限界の両方を強調し、モデル設計を進めるための実用的な洞察を提供する。

論文の概要: 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

関連論文リスト