Fugu-MT 論文翻訳(概要): IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

論文の概要: IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

arxiv url: http://arxiv.org/abs/2605.23187v1
Date: Fri, 22 May 2026 03:09:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.177572
Title: IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction
Title（参考訳）: IntentionNav: インテンシブなヒューマンインストラクションからインテント駆動オブジェクトナビゲーションのベンチマーク
Authors: Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin,
Abstract要約: IntentionNavは、暗黙の人間の指示からアクティブなオブジェクト検索を行うための診断ベンチマークである。 IntentionNavには、176のIsaac Simシーンと64のターゲットカテゴリに500の意図がある。モデルは48.3%のエピソードで目標を特定し、68.7%で2m地区に入るが、わずか24.9%で終了し、5.5%で1mの成功を達成した。
参考スコア（独自算出の注目度）: 17.341498923142595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.
Abstract（参考訳）: 既存のオブジェクトナビゲーションベンチマークは、通常、電子レンジや椅子など、どのオブジェクトカテゴリーを見つけるかを示すエンボディエージェントを示す。この食べ物を温める何かが必要だ」、あるいは「部屋はぬいぐるみを感じている」。エージェントは、ニーズを満たすことができるオブジェクトを推論し、シーングランドのインスタンスを見つけ、ゴールが到達したかどうかを判断する必要があります。我々は、この設定を意図駆動型オブジェクトナビゲーションとして研究し、暗黙の人間の指示からアクティブなオブジェクト検索の診断ベンチマークであるIntentionNavを導入する。各エピソードは、自由テキストインテント、RGB-D観察、ポーズを提供するが、対象のオブジェクト名を保持しない。 IntentionNavには、176のIsaac Simシーンと64のターゲットカテゴリに500の意図がある。それぞれのインテントは4つの制御された命令スタイルで書き直され、4つのインテントモードのうちの1つで注釈付けされ、マッチした幾何学の下でセマンティックキュータイプから表面のフレーズを分離する。このペア設計は、総合的な成功だけでなく、ターゲット推論、言語堅牢性、近隣到達可能性、ターミナル成功の分析をサポートする。固定型アクティブナビゲーション剤を用いて3種類のVLMを評価した。モデルは48.3%のエピソードで目標を特定し、68.7%で2m地区に入るが、わずか24.9%で終了し、5.5%で1mの成功を達成した。成功は、出来事記述意図(28.7%)が最も高く、身体状態や手当意図(19.2%と18.5%)は低い。

論文の概要: IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

関連論文リスト