Fugu-MT 論文翻訳(概要): PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

論文の概要: PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

arxiv url: http://arxiv.org/abs/2606.05744v1
Date: Thu, 04 Jun 2026 06:17:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.588186
Title: PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
Title（参考訳）: PlanBench-V:視覚言語モデルのための空間計画図ベンチマーク
Authors: Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng, Wenjia Zhang,
Abstract要約: 空間計画図の解釈において、視覚言語モデル(VLM)を評価するための最初の総合的なベンチマークであるPlanBench-Vを紹介する。まず,プロのプランナーによる223の計画図と1629の質問応答ペアからなる専門家によるデータセットである空間計画地図データベース(SPMD)を構築した。次に、認識、推論、アソシエーション、実装の4つの進歩的能力を評価する理論インフォームド・アセスメント・フレームワークを提案する。
参考スコア（独自算出の注目度）: 12.535782832272062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.
Abstract（参考訳）: 空間計画地図は、計画目標、規制、空間戦略を意思決定、公共通信、制度調整のための視覚形式に変換する領域統治の中心である。しかし、その解釈には、きめ細かい視覚的知覚、空間的推論、およびポリシーにインフォームドされた専門的判断が必要であり、人間の学習者とAIシステムの両方にとって大きな課題を生み出している。 VLM(Vision-Language Models)の急速な進歩に伴い、都市計画分析におけるその利用が注目されているが、既存のマルチモーダルベンチマークは主に一般的な視覚的理解を目標とし、計画実践のドメイン固有の認知プロセスを見落としている。このギャップに対処するために、空間計画地図解釈においてVLMを評価するための最初の総合的なベンチマークであるPlanBench-Vを紹介する。まず,223の計画図と1629の質問応答対をプロのプランナーがキュレートし,多様な地域と地図スタイルを網羅した空間計画地図データベース(SPMD)を構築した。そこで我々は,計画地図解釈の認知パイプラインに対応する,知覚,推論,アソシエーション,実装の4つの段階的能力を評価する理論インフォームド評価フレームワークを提案する。 VLMの2世代にわたる大規模な実験は、明確な進歩と永続的な限界を示している。 2026年最高のエージェント推論モデルであるQwen3.6-Plusは、2025年最高のモデルであるGPT-4oを27%上回っている。それでも、すべてのモデルは、評価的判断、ポリシーの感度、制約に敏感な意思決定を必要とする実装指向のタスクに苦慮している。これらの知見は、プロの計画文脈における現在のVLMの基本的限界を明らかにし、ドメイン適応型マルチモーダル推論フレームワークの必要性を強調している。コードとデータはhttps://plangpt.github.io.comで公開されている。

論文の概要: PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

関連論文リスト