Fugu-MT 論文翻訳(概要): P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

論文の概要: P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2605.19634v1
Date: Tue, 19 May 2026 10:18:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.275774
Title: P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation
Title（参考訳）: P2DNav:Zero-shot Vision-and-Language Navigationのためのパノラマ・ツー・ダウンビュー推論
Authors: Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen,
Abstract要約: P2DNavはゼロショット視覚言語ナビゲーションのための階層的なフレームワークである。 P2DNavはPanorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), Reflective Reorientation Mechanism (RRM)の3つのコアコンポーネントで構成されている。
参考スコア（独自算出の注目度）: 30.45812977392826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.
Abstract（参考訳）: VLN(Vision-and-Language Navigation)は、自然言語による指示を、目に見えない環境で実行可能なナビゲーションアクションに固定するために、エンボディエージェントを必要とする。既存のゼロショット法は、通常追加のウェイポイント予測モジュールに依存しており、しばしば高レベルな方向推論をきめ細かな局所的な根拠と絡めて、エラーを起こし不安定な決定を下す。本稿では,ゼロショット視覚・言語ナビゲーションのための階層型フレームワークであるP2DNavを提案する。 P2DNavはPanorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), Reflective Reorientation Mechanism (RRM)の3つのコアコンポーネントで構成されている。 P2Dは航法決定をパノラマ方向選択とダウンビュー局所接地という2つの段階に明確に分解する。まず、360度パノラマから命令関連方向を選択し、その方向のダウンビューRGB観測から画素レベルの目標点を予測する。さらに、SDMは、ナビゲーション履歴をマルチターン対話コンテキストとして整理し、スライディングウィンドウ内での最近の視覚的観察を維持し、長距離ナビゲーションをサポートする。 RRMは、ダウンビュー観測に基づいて局所グラウンドの信頼性を評価し、必要に応じてパノラマ方向選択に戻すことにより、反射的再配向を可能にする。 R2R-CEベンチマークの実験により、P2DNavはゼロショット法で強い性能を発揮することが示された。特に、最先端(SOTA)のゼロショットウェイポイントベースとウェイポイントフリーの手法と比較して、P2DNavは、それぞれ146.6%と58.9%のSRゲインを達成し、ゼロショットVLNに対するP2D、SDM、RRMの有効性を実証している。コードは一般公開される予定だ。

論文の概要: P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

関連論文リスト