Fugu-MT 論文翻訳(概要): Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

論文の概要: Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

arxiv url: http://arxiv.org/abs/2508.13142v2
Date: Mon, 13 Oct 2025 16:08:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 15:48:09.141132
Title: Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Title（参考訳）: 空間インテリジェンスにおけるマルチモーダルLLMの全体的評価
Authors: Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang,
Abstract要約: GPT-5は、これまでで最も強力なAIモデルと言われているが、空間知能タスクの幅広い範囲において、人間のパフォーマンスに欠けていた。我々はまた、人間にとって直感的であるが、最も先進的なマルチモーダルモデルでさえも失敗する様々なシナリオの集合に対して質的な評価を行う。
参考スコア（独自算出の注目度）: 82.20514207247675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence. We first propose a holistic taxonomy of spatial tasks that unifies existing benchmarks and a standardized protocol for the fair evaluation of state-of-the-art proprietary and open-source models across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail even the most advanced multimodal models.
Abstract（参考訳）: マルチモーダルモデルは近年顕著な進歩を遂げている。それでも彼らは、空間的理解と推論において顕著な限界を示し続けている。 GPT-5の最近のリリースは、これまででもっとも強力なAIモデルと言われているが、主要なモデル(GPT、Gemini、Grok、Seed、Qwen、Intern)が空間知性への道をどこに立っているかを調べるのがタイミングだ。まず,既存のベンチマークを統一した空間的タスクの全体的分類法と,8つの主要なベンチマークにおける最先端のプロプライエタリおよびオープンソースモデルの公平な評価のための標準化されたプロトコルを提案する。実験により,(1) GPT-5は空間知能(SI)において前例のない強みを示すが,(2)多種多様なSIタスクにおいて人的性能に欠けることが明らかとなった。さらに, SIタスクよりも, SIタスクよりもモデル能力の不足が大きいこと, (4) プロプライエタリモデルが最も困難なタスクに直面する場合, 決定的な優位性を示しないことを示す。さらに、人間にとって直感的だが、最も先進的なマルチモーダルモデルでさえも失敗する様々なシナリオの集合に対して質的な評価を行う。

論文の概要: Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

関連論文リスト