Fugu-MT 論文翻訳(概要): ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

論文の概要: ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.20837v1
Date: Wed, 20 May 2026 07:27:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.553671
Title: ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
Title（参考訳）: ArchSIBench: 視覚-言語モデルのアーキテクチャ空間知能のベンチマーク
Authors: Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang,
Abstract要約: ArchSIBenchはアーキテクチャ、認知科学、心理学の視点に基づくアーキテクチャ空間知能のベンチマークである。 ArchSIBenchは、知覚、推論、ナビゲーション、変換、構成の5つのコアディメンションをカバーしており、17のきめ細かいサブタスクで構成されている。様々な視覚言語モデル(VLM)を評価し,ほとんどのモデルにおける空間的インテリジェンス(空間的インテリジェンス)は,人間のベースラインとは大きく異なることを示す。
参考スコア（独自算出の注目度）: 16.656416066183887
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.
Abstract（参考訳）: 建築空間の認識と推論能力である建築空間知能は、ロボットナビゲーション、具体的相互作用、三次元シーンの理解と生成といったタスクに基本的である。視覚・言語モデル(VLM)の基本的空間的スキル(相対方向、距離比較、オブジェクトカウントなど)は広く評価されているが、これらのタスクは空間的認知の最も基本的なレベルのみをカバーし、レイアウト理解、循環パターン、機能的ゾーニングなどのアーキテクチャ空間の高レベル認知をほとんど見落としている。本稿では,建築,認知科学,心理学の視点に基づく建築空間知能ベンチマークであるArchSIBenchを紹介する。 ArchSIBenchは、知覚、推論、ナビゲーション、変換、構成の5つのコアディメンションをカバーしており、17のきめ細かいサブタスクで構成されている。建築背景の専門家による手作業による注意深い注釈を通じて,3000の質問応答ペアを構築し,建築空間知能の包括的評価を可能にする。 ArchSIBenchに基づいて様々なVLMを評価し,多くのモデルの空間的インテリジェンスが人間のベースラインと有意な違いを示すこと,また,モデルがキャパシティディメンション間で有意な変動を示すことを見出した。最先端のモデルの中には、建築訓練なしで人間の評価者レベルにアプローチできるものもある。しかし、特に空間変換や構成推論において、建築訓練を行う人間の評価者と比較して明らかなギャップが残っている。 VLMの空間的インテリジェンスを計測・向上するために,ArchSIBenchが重要な洞察と体系的資源を提供すると考えている。データセットとコードはhttps://huggingface.co/datasets/ArchSIBench/ArchSIBenchで公開されている。

論文の概要: ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

関連論文リスト