Fugu-MT 論文翻訳(概要): Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

論文の概要: Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

arxiv url: http://arxiv.org/abs/2603.25981v1
Date: Thu, 26 Mar 2026 23:47:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.316577
Title: Policy-Guided World Model Planning for Language-Conditioned Visual Navigation
Title（参考訳）: 言語記述型ビジュアルナビゲーションのための政策誘導型世界モデル計画
Authors: Amirhosein Chahe, Lifeng Zhou,
Abstract要約: 我々は、学習ナビゲーションポリシーの長所と、命令条件付きビジュアルナビゲーションのための潜在世界モデルプランニングを組み合わせた2段階のフレームワークであるPiJEPAを提案する。実世界のナビゲーションタスクの実験では、PiJEPAはスタンドアロンのポリシー実行と非インフォームドなワールドモデル計画の両方で著しく優れています。
参考スコア（独自算出の注目度）: 12.836371451772438
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high-quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V-JEPA-2, across both the policy and world model components. Experiments on real-world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal-reaching accuracy and instruction-following fidelity.
Abstract（参考訳）: 自然言語命令を与えられた視覚的に指定された目標にナビゲートすることは、AIを具現化する上での根本的な課題である。既存のアプローチは、長期計画に苦しむリアクティブポリシーに依存するか、高次元空間でのアクション初期化に苦しむ世界モデルを採用するかのいずれかである。我々は、学習ナビゲーションポリシーの長所と、命令条件付きビジュアルナビゲーションのための潜在世界モデルプランニングを組み合わせた2段階のフレームワークであるPiJEPAを提案する。第1段階では、CASTナビゲーションデータセット上に、凍結した事前訓練された視覚エンコーダ(DINOv2またはV-JEPA-2)を付加したOctoベースのジェネラリストポリシーを微調整し、現在の観察および言語指導に基づいて情報行動分布を生成する。第2段階では、このポリシーに基づく分布を用いて、同じ冷凍エンコーダの埋め込み空間における将来の潜伏状態を予測する、個別に訓練されたJEPAワールドモデル上でのモデル予測パス積分(MPPI)計画のウォームスタートを行う。 MPPIサンプリング分布を非インフォームドガウシアンからではなく、ポリシーから初期化することにより、プランナーはゴールに達するような高速な高品質なアクションシーケンスに収束する。本研究では,DINOv2とV-JEPA-2を比較し,ビジョンエンコーダのバックボーンの効果を,ポリシとワールドモデルの両方で系統的に検討した。実世界のナビゲーションタスクの実験では、PiJEPAはスタンドアローンのポリシー実行と非インフォームの世界モデル計画の両方を著しく上回り、目標達成の精度と命令追従の忠実さの向上を実現している。

論文の概要: Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

関連論文リスト