Fugu-MT 論文翻訳(概要): SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

論文の概要: SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

arxiv url: http://arxiv.org/abs/2606.08992v1
Date: Mon, 08 Jun 2026 03:42:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.684182
Title: SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning
Title（参考訳）: SpaceVLN: オンライン空間認知記憶と推論機能を備えたゼロショット視覚・言語ナビゲーションエージェント
Authors: Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang,
Abstract要約: SpaceVLNは、空間認知記憶とタスク誘導空間推論を中心に構築されたナビゲーションエージェントである。このメモリ上に構築されたSpatial-CoTは、タスクプログレス推論と空間知覚、分析、予測を統合する。 R2R-CE、RxR-CE、GN-Bench、HM3D-OVONの他、SpaceVLNは最先端のゼロショット性能を実現している。
参考スコア（独自算出の注目度）: 59.64305326980364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.
Abstract（参考訳）: 連続環境における視覚・言語ナビゲーションでは、エージェントは、言語命令に従うために、以前は目に見えない環境の空間構造を理解する必要がある。基礎モデルは、タスク固有の政策訓練なしにゼロショットナビゲーションへの有望な道を開いたが、多くのナビゲーターは、探索地域、横断経路、ランドマーク、およびそれらの空間的関係を見渡して、局所的な視覚的手がかりと線形履歴に基づく推論に依存している。本論文では,空間認知記憶とタスク誘導空間推論を中心に構築されたナビゲーションエージェントであるSpaceVLNを提案する。具体的には、SpaceVLNは、検証可能なスペースランドマークステージを中心に計画と実行を編成する効率的なステージワイドクローズドループフレームワークを導入している。ナビゲーション中、エージェントは探索領域を空間的ウェイポイントに徐々に抽象化し、サブタスクを基盤としたランドマークエビデンスを動的に維持し、進行ローカライゼーションと空間関係理解のための階層的な空間認知記憶を形成する。このメモリ上に構築されたSpatial-CoTは、タスクプログレス推論を空間知覚、分析、予測と統合し、タスクガイド型空間推論を具体化してナビゲーションを可能にする。統一されたステージインタフェースにより、SpaceVLNはタスク固有のポリシートレーニングなしで、ビジョン・アンド・ランゲージ・ナビゲーションとオブジェクト・ゴール・ナビゲーションの両方に対応できる。 R2R-CE、RxR-CE、GN-Bench、HM3D-OVONの他、SpaceVLNは最先端のゼロショット性能を実現し、実際のロボット展開はその適用性をさらに検証している。これらの結果から,より強力なボディードナビゲーションエージェントの実践的基盤として,空間認知記憶とタスクガイド型空間推論が注目されている。

論文の概要: SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

関連論文リスト