Fugu-MT 論文翻訳(概要): Probing Prompt Design for Socially Compliant Robot Navigation with Vision Language Models

論文の概要: Probing Prompt Design for Socially Compliant Robot Navigation with Vision Language Models

arxiv url: http://arxiv.org/abs/2601.14622v1
Date: Wed, 21 Jan 2026 03:45:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.87469
Title: Probing Prompt Design for Socially Compliant Robot Navigation with Vision Language Models
Title（参考訳）: 視覚言語モデルを用いた社会適応型ロボットナビゲーションのプロンプト設計
Authors: Ling Xiao, Toshihiko Yamasaki,
Abstract要約: 言語モデルは、ますますソーシャルロボットナビゲーションに使われている。既存のベンチマークは、社会的に従順な行動のための急進的な設計を概ね見落としていた。システムガイダンスとモチベーションフレーミングの2つの側面に沿ってプロンプトデザインを研究する。
参考スコア（独自算出の注目度）: 31.097911935522674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models are increasingly used for social robot navigation, yet existing benchmarks largely overlook principled prompt design for socially compliant behavior. This limitation is particularly relevant in practice, as many systems rely on small vision language models (VLMs) for efficiency. Compared to large language models, small VLMs exhibit weaker decision-making capabilities, making effective prompt design critical for accurate navigation. Inspired by cognitive theories of human learning and motivation, we study prompt design along two dimensions: system guidance (action-focused, reasoning-oriented, and perception-reasoning prompts) and motivational framing, where models compete against humans, other AI systems, or their past selves. Experiments on two socially compliant navigation datasets reveal three key findings. First, for non-finetuned GPT-4o, competition against humans achieves the best performance, while competition against other AI systems performs worst. For finetuned models, competition against the model's past self yields the strongest results, followed by competition against humans, with performance further influenced by coupling effects among prompt design, model choice, and dataset characteristics. Second, inappropriate system prompt design can significantly degrade performance, even compared to direct finetuning. Third, while direct finetuning substantially improves semantic-level metrics such as perception, prediction, and reasoning, it yields limited gains in action accuracy. In contrast, our system prompts produce a disproportionately larger improvement in action accuracy, indicating that the proposed prompt design primarily acts as a decision-level constraint rather than a representational enhancement.
Abstract（参考訳）: 言語モデルは、ますます社会ロボットナビゲーションに使われているが、既存のベンチマークは、社会的に従順な行動のためのプロンプトデザインを概ね見落としている。この制限は、多くのシステムが効率のために小さな視覚言語モデル(VLM)に依存しているため、実際は特に関係がある。大規模な言語モデルと比較すると、小さなVLMはより弱い意思決定能力を示し、正確なナビゲーションに効果的なプロンプト設計を重要視している。人間の学習とモチベーションの認知理論にインスパイアされた私たちは、システムガイダンス(行動中心、推論指向、知覚推論のプロンプト)とモチベーションフレーミング(モデルが人間、他のAIシステム、あるいは過去の自分自身と競合する)という2つの側面に沿って、プロンプトデザインを研究する。ソーシャルに準拠する2つのナビゲーションデータセットの実験では、3つの重要な発見が明らかになった。第一に、非精細なGPT-4oでは、人間との競争が最高のパフォーマンスを達成する一方、他のAIシステムとの競争は最悪である。微調整されたモデルでは、過去の自己に対する競争が最強の結果となり、続いて人間に対する競争が続き、パフォーマンスはプロンプト設計、モデル選択、データセット特性の結合効果にさらに影響される。第二に、不適切なシステムプロンプト設計は直接微調整と比較して性能を著しく低下させる可能性がある。第三に、直接微調整は知覚、予測、推論といった意味レベルの指標を大幅に改善する一方で、行動精度が制限される。これとは対照的に,本システムでは,動作精度の大幅な向上を図り,提案したプロンプト設計が表現の強化よりも決定レベルの制約として機能することを示唆している。

論文の概要: Probing Prompt Design for Socially Compliant Robot Navigation with Vision Language Models

関連論文リスト