Fugu-MT 論文翻訳(概要): Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

論文の概要: Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

arxiv url: http://arxiv.org/abs/2510.19400v1
Date: Wed, 22 Oct 2025 09:20:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:15.523206
Title: Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Title（参考訳）: 横断的な視点:ロボットシーンにおける視覚言語モデルの空間的推論のベンチマーク
Authors: Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo,
Abstract要約: 視覚言語モデル(VLM)は、ロボットが複雑な環境で知覚、理性、行動することができるように、エンボダイドAIに不可欠である。 VLMのほとんどの評価はシングルビュー設定に重点を置いており、探索されていないマルチビュー情報を統合する能力を残している。本稿では,ロボット操作におけるVLMの多視点空間推論能力を評価するためのベンチマークであるMV-RoboBenchを紹介する。
参考スコア（独自算出の注目度）: 33.80107254496374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
Abstract（参考訳）: 視覚言語モデル(VLM)は、ロボットが複雑な環境で知覚、理性、行動することができるように、エンボダイドAIに不可欠である。また、近年のVision-Language-Action(VLA)モデルの基盤としても機能している。しかし、VLMのほとんどの評価はシングルビュー設定に重点を置いており、探索されていないマルチビュー情報を統合する能力を残している。同時に、マルチカメラのセットアップは、ロボットプラットフォームではますます標準になっている。 VLMがそのようなマルチビュー入力をロボット推論に効果的に活用できるかどうかは未解決の問題である。このギャップを埋めるために,ロボット操作におけるVLMの多視点空間推論能力を評価するためのベンチマークであるMV-RoboBenchを導入する。 MV-RoboBenchは、8つのサブタスクにまたがる1.7kの手作業によるQAアイテムで構成され、空間的理解とロボット実行の2つの主要なカテゴリに分けられる。オープンソースモデルとクローズドソースモデルの両方を含む既存のVLMの多様なセットと、CoTにインスパイアされた技術を取り入れた拡張バージョンを評価した。その結果、最先端のモデルは人間のパフォーマンスよりはるかに低いままであり、VLMがマルチビューロボット認識において直面する大きな課題を浮き彫りにしている。さらに、分析によって2つの重要な発見が明らかになった。 (i)空間知能とロボットタスク実行は、多視点ロボットシナリオにおいて正の相関関係にある。 (II) 既存の汎用単一視点空間理解ベンチマークの強い性能は、我々のベンチマークで評価されたロボット空間タスクの成功と確実に一致しない。我々はMV-RoboBenchをオープンリソースとしてリリースし、空間的に接地されたVLMとVLAの進歩を促進するとともに、データだけでなく、マルチビューの具体的推論のための標準化された評価プロトコルも提供する。

論文の概要: Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

関連論文リスト