Fugu-MT 論文翻訳(概要): SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

論文の概要: SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

arxiv url: http://arxiv.org/abs/2509.08757v1
Date: Wed, 10 Sep 2025 16:47:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-11 15:16:52.517518
Title: SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
Title（参考訳）: SocialNav-SUB:ソーシャルロボットナビゲーションにおけるシーン理解のためのベンチマークVLM
Authors: Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, Peter Stone,
Abstract要約: ダイナミックで人間中心の環境でのソーシャルナビゲーションには、堅牢なシーン理解に基づく社会的に適合した決定が必要である。近年のビジョン・ランゲージ・モデル (VLM) は、社会ロボットナビゲーションの曖昧な要求に沿う有望な能力を示している。本稿では,ソーシャルナビゲーションシーン理解ベンチマーク(SocialNav-SUB)を紹介する。
参考スコア（独自算出の注目度）: 32.75496547879437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .
Abstract（参考訳）: ダイナミックで人間中心の環境でのロボットナビゲーションには、堅牢なシーン理解に基づく社会的に適合した決定が必要である。近年のビジョン・ランゲージ・モデル(VLM)は、物体認識、常識推論、社会ロボットナビゲーションのニュアンスな要求に沿った文脈理解能力などの有望な能力を示す。しかしながら、VLMが複雑な社会ナビゲーションシーン(例えばエージェントと人間の意図の空間的時間的関係を推測する)を正確に理解できるかは、安全かつ社会的に適合するロボットナビゲーションに不可欠である。近年、社会ロボットナビゲーションにおけるVLMの使用を探求する研究もあるが、これらの必要条件を満たす能力について体系的に評価する研究は存在しない。本稿では,ソーシャルナビゲーションシーン理解ベンチマーク(SocialNav-SUB)について紹介する。VQA(Visual Question Answering)データセットと,実世界のソーシャルロボットナビゲーションシナリオにおけるシーン理解のためのVLM評価ベンチマークである。 SocialNav-SUBは、ソーシャルロボットナビゲーションにおいて空間的、時空間的、社会的推論を必要とするVQAタスク全体で、人間とルールに基づくベースラインに対してVLMを評価する統一的なフレームワークを提供する。最先端のVLMを用いた実験により,最も優れたVLMは,人間の回答に同意する可能性を高める一方で,より単純なルールベースアプローチと人間のコンセンサスベースラインを過小評価し,現在のVLMの社会的シーン理解における重要なギャップを示唆していることがわかった。我々のベンチマークは、ソーシャルロボットナビゲーションの基盤モデルに関するさらなる研究のステージを設定し、現実のソーシャルロボットナビゲーションのニーズを満たすためにVLMをどのように調整できるかを探求するためのフレームワークを提供する。この論文の概要とコードとデータはhttps://larg.github.io/socialnav-sub で見ることができる。

論文の概要: SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

関連論文リスト