Fugu-MT 論文翻訳(概要): The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

論文の概要: The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

arxiv url: http://arxiv.org/abs/2605.20544v1
Date: Tue, 19 May 2026 22:32:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.402889
Title: The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents
Title（参考訳）: Yes-Man症候群 : 身体的ロボットエージェントにおけるベンチマークの欠如
Authors: Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik,
Abstract要約: 視覚言語モデル(VLM)は、エンボディエージェントのプランナーとして使用される。本稿では, ロボット工学の文脈において, 禁忌を分類するための分類法を提案する。本稿では,画像に接地した禁忌指示を生成するためのフレームワークであるRoboAbstentionを紹介する。
参考スコア（独自算出の注目度）: 14.695254264082273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.
Abstract（参考訳）: 視覚言語モデル(VLM)は、エンボディエージェントの高レベルプランナーとして使われ、自然言語の指示や視覚的な観察をアクションプランに翻訳する。以前の研究はLSMの禁忌について研究してきたが、既存のベンチマークは主にテキストのみであり、エンボディされたロボティクス環境に固有の知覚的接地や物理的制約を捉えていない。このような設定では、禁忌は、命令が曖昧で、物理的に不可能である、偽の前提に基づいて、あるいは、利用可能な感覚のモダリティと文脈が考慮されて、解決不可能である、という認識を必要とする。このギャップに対処するために,5つのロボティクスデータセットから収集した画像に基づいた禁忌指示を生成するためのスケーラブルで監査可能なフレームワークであるRoboAbstentionを紹介する。 RoboAbstentionは、(1)構造化された視覚的接地、(2)決定論的制約の導出、(3)カテゴリー固有のテンプレートによる制御された命令生成という3段階のパイプラインを通して分類をインスタンス化する。これにより、検証可能な留意条件を持つ多様なデータセットの構築が可能になる。我々は、いくつかのフロンティアVLMを評価し、全てのモデルが、高度な推論能力を持つモデルを含む、棄権の重大な弱点を示すことを発見した。最高のパフォーマンスのモデルであるGemini 2.5 Flashは6,069のベンチマーク命令のうち39.0%しか棄権せず、具体化されたプランナーであるGemini Robotics ER 1.6 Previewは16.5%で棄権している。さらに,防衛的プロンプトやコンテキスト内学習などのVLMプランナの禁忌性向上手法についても検討し,これらの介入によってパフォーマンスが大幅に向上し,Gemini Robotics ER 1.6 Previewの93.6%,GPT 5.4 Miniの88.6%に到達した。われわれはRoboAbstentionをhttps://purseclab.github.io/RoboAbstention/.comでオープンソース化した。

論文の概要: The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

関連論文リスト