Fugu-MT 論文翻訳(概要): From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

論文の概要: From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

arxiv url: http://arxiv.org/abs/2605.02130v1
Date: Mon, 04 May 2026 01:19:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.098084
Title: From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
Title（参考訳）: 場所から目的へ:マルチモーダルLLMにおける空間文インテリジェンスのベンチマーク
Authors: Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal, Bo-Hsiang Tseng,
Abstract要約: 本稿では,1500以上の専門家に注釈を付けたビデオベースのベンチマークであるSFI-Benchについて紹介する。 SFI-Benchは先進的推論の2つの相補的次元を体系的に評価する。
参考スコア（独自算出の注目度）: 19.943841049221625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.
Abstract（参考訳）: 人間レベルのエージェントインテリジェンスは、低レベルの幾何学的知覚を超えて、物事がどこにあるかを認識して、目的が何であるかを理解するように進化します。既存のベンチマークでは、マルチモーダル大言語モデル(MLLM)の幾何学的知覚能力を効果的に評価しているが、基底知能に必要とされる高次認知能力の検証には至らなかった。このギャップに対処するために,多種多様なエゴセントリックな屋内ビデオスキャンから1500以上の専門家が回答したビデオベースのベンチマークであるSFI-Bench(Spatial-Functional Intelligence Benchmark)を紹介する。 SFI-Bench は,(1) 複雑なレイアウトの理解とコヒーレントな空間表現の形成を必要とする構造化空間推論,(2) オブジェクトの空き度と文脈依存的ユーティリティを推定する機能推論の2つの相補的次元を体系的に評価する。このベンチマークには、条件付きカウント、マルチホップリレーショナル推論、機能的ペアリング、知識に基づくトラブルシューティングといったタスクが含まれており、知覚、記憶、推論を統合するために直接挑戦するモデルが含まれている。実験の結果,現在のMLLMは,空間記憶と機能的推論と外的知識を一貫して組み合わせることに苦慮し,基盤となる知性を達成する上で重要なボトルネックを浮き彫りにしていることがわかった。したがって、SFI-Benchは、より認知能力が高く、真に接地されたマルチモーダルエージェントへの進歩を測定するための診断ツールを提供する。

論文の概要: From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

関連論文リスト