Fugu-MT 論文翻訳(概要): WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

論文の概要: WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

arxiv url: http://arxiv.org/abs/2605.15964v1
Date: Fri, 15 May 2026 13:55:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 17:44:16.333679
Title: WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
Title（参考訳）: WorldVLN:Aerial Vision-Language Navigationのための自己回帰的世界行動モデル
Authors: Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li,
Abstract要約: 航空VLNのための世界初の自己回帰的世界行動モデルであるWorldVLNを提案する。 WorldVLNは、短水平世界状態遷移を予測するために、遅延自己回帰ビデオバックボーンを適用する。 WorldVLNは、既存のVision-Language-Actionベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 31.224842983083803
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.
Abstract（参考訳）: 航空視覚言語ナビゲーション(VLN)では、エージェントはクローズドループの知覚と3D環境におけるアクションを通じて自然言語の指示に従う必要がある。我々は,航空VLNは予測駆動型世界行動問題として定式化できると主張している。そこで本研究では,航空VLNのための世界初の自己回帰的世界行動モデルであるWorldVLNを提案する。ビジュアルクリップ全体を生成するフルシーケンスビデオ生成ワールドモデルとは異なり、WorldVLNは、短水平世界状態遷移を予測するために遅延自己回帰ビデオバックボーンを適用し、それらを実行可能なウェイポイントアクションに直接デコードする。各アクションセグメントが実行されると、新たに受信した観測結果を自己回帰コンテキストに符号化し、クローズドループのワールドアクション予測を可能にする。さらに,2段階のトレーニングフレームワークを導入し,命令条件付きナビゲーションダイナミックスに先立って映像を基盤として,自動回帰型WAMに適した最初の強化学習手法であるAction-Aware GRPOを開発し,下流でのロールアウト結果による経路決定を最適化する。公開の屋外と屋内のベンチマークでは、WorldVLNは既存のVision-Language-Actionベースラインを12.5%以上の成功率で上回り、挑戦的なケースでは大きな優位性を持つ。さらに、ゼロショットを実際のドローン配備に転送し、提案されているWorldVLNが空間行動タスクに有望なルートを提供することを示唆している。デモとコードはhttps://embodiedcity.github.io/WorldVLN/で公開されている。

論文の概要: WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

関連論文リスト