Fugu-MT 論文翻訳(概要): AgentVLN: Towards Agentic Vision-and-Language Navigation

論文の概要: AgentVLN: Towards Agentic Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2603.17670v1
Date: Wed, 18 Mar 2026 12:43:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.949927
Title: AgentVLN: Towards Agentic Vision-and-Language Navigation
Title（参考訳）: AgentVLN:Agentic Vision-and-Language Navigationを目指して
Authors: Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, Shengjun Huang,
Abstract要約: VLN (Vision-and-Language Navigation) は、複雑な自然言語命令を、見えない環境での長距離ナビゲーションに接地するために、エンボディエージェントを必要とする。本稿では,エッジコンピューティングプラットフォーム上に展開可能な,新規かつ効率的なナビゲーションフレームワークであるAgentVLNを提案する。
参考スコア（独自算出の注目度）: 78.739525400071
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.
Abstract（参考訳）: VLN (Vision-and-Language Navigation) は、複雑な自然言語命令を、見えない環境での長距離ナビゲーションに接地するために、エンボディエージェントを必要とする。 Vision-Language Models (VLM) は強力な2Dセマンティック理解を提供するが、現在のVLNシステムは空間認識の制限や2D-3D表現ミスマッチ、モノクルスケールの曖昧さによって制約されている。本稿では,エッジコンピューティングプラットフォーム上に展開可能な,新規かつ効率的なナビゲーションフレームワークであるAgentVLNを提案する。我々は、VLNを部分観測可能なセミマルコフ決定プロセス(POSMDP)として定式化し、プラグイン・アンド・プレイスキル・ライブラリを通じて高レベルのセマンティック推論と認識と計画を分離するVLM-as-Brainパラダイムを導入する。マルチレベル表現の不整合を解決するために,3次元トポロジカル・ウェイポイントを画像平面に投影し,VLMの画素アラインな視覚的プロンプトを生成するクロススペース表現マッピングを設計する。本橋を架け橋として,閉鎖から回復し,長い軌道上での誤差蓄積を抑制するために,文脈認識型自己補正・能動的探索戦略を統合した。非構造化環境における命令の空間的あいまいさにさらに対処するために,メタ認知能力を持つエージェントが幾何学的深度情報を積極的に探すことのできるクエリ駆動型知覚的連鎖(QD-PCoT)スキームを提案する。最後に,ターゲット可視性を考慮した動的ステージルーティングを備えた大規模命令チューニングデータセットであるAgentVLN-Instructを構築した。大規模な実験により、AgentVLNは長期のVLNベンチマークにおいて常に最先端のSOTA(State-of-the-art Method)よりも優れており、次世代のエンボディドナビゲーションモデルの軽量展開のための実用的なパラダイムを提供する。コード:https://github.com/Allenxinn/AgentVLN。

論文の概要: AgentVLN: Towards Agentic Vision-and-Language Navigation

関連論文リスト