Fugu-MT 論文翻訳(概要): Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

論文の概要: Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

arxiv url: http://arxiv.org/abs/2604.17473v1
Date: Sun, 19 Apr 2026 15:03:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.543353
Title: Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Title（参考訳）: Dual-Anchoring:ビジョンランゲージナビゲーションにおける状態ドリフトの対応
Authors: Kangyi Wu, Pengna Li, Kailin Lyu, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu,
Abstract要約: VLN(Vision-Language Navigation)は、自然言語の指示に従うことで、エージェントが3D環境をナビゲートする必要がある。最近のビデオ大言語モデル(Video-LLMs)は、主にVLNが進歩しているが、長いシナリオではState Driftの影響を受けやすい。本稿では,命令の進行と履歴表現を明示的にアンロックするデュアル・アンチョリング・フレームワークを提案する。
参考スコア（独自算出の注目度）: 16.424156408535637
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
Abstract（参考訳）: Vision-Language Navigation(VLN)は、自然言語の指示に従うことによって3D環境をナビゲートするエージェントを必要とする。最近のビデオ大言語モデル(Video-LLMs)はVLNがほとんど進歩しているが、長いシナリオではState Driftの影響を受けやすい。このような場合、エージェントの内部状態は真のタスク実行状態から逸脱し、目的のないさまよりと命令で不可欠な操作を実行するのに失敗する。エージェントが完了したサブゴールと残りのゴールを区別できないプログレッシブ・ドリフトと、エージェントの履歴表現が劣化し、訪問したランドマークが失われるメモリ・ドリフトである。本稿では,命令の進行と履歴表現を明示的にアンロックするデュアルアンチョリングフレームワークを提案する。まず、プログレスドリフトに対応するために、インストラクションプログレスアンコリングを導入します。第2に、メモリドリフトを緩和するために、ランドマーク中心の世界モデルを用いて、セグメント・エキシング・モデルによって抽出されたオブジェクト中心の埋め込みを振り返って予測するメモリランドマークアンカリングを提案し、エージェントに過去の観測を明示的に検証し、訪問したランドマークの異なる表現を保存するように促す。このフレームワークを実現するために、明確な進捗記述を持つ360万のサンプルと、レトロスペクティブ検証のためのランドマークデータである937kという、2つの広範なデータセットをキュレートしました。シミュレーションと実環境の両方における大規模な実験は、我々の手法の優位性を証明し、成功率を15.2%改善し、長距離軌道で24.7%向上した。さらなる研究を容易にするため、コード、データ生成パイプライン、収集したデータセットをリリースします。

論文の概要: Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

関連論文リスト