Fugu-MT 論文翻訳(概要): Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

論文の概要: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

arxiv url: http://arxiv.org/abs/2602.05827v1
Date: Thu, 05 Feb 2026 16:16:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:09.032306
Title: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Title（参考訳）: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Authors: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li,
Abstract要約: Beyond-the-View Navigation(BVN)では、エージェントは密集したステップバイステップのガイダンスなしで、遠く、見えないターゲットを見つける必要がある。既存の大規模言語モデル(LLM)ベースの手法は、短焦点監督に依存しているため、しばしば近視行動に悩まされる。 20秒の地平線にまたがるスパース未来によって導かれるサブ秒軌跡推論を実現するSparseVideoNavを提案する。
参考スコア（独自算出の注目度）: 18.136190060725102
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
Abstract（参考訳）: なぜ視覚言語ナビゲーションが詳細で冗長な言語命令に結び付けられなければならないのか? このような詳細は意思決定を容易にするが、現実世界でのナビゲーションの目的とは根本的に矛盾する。理想的には、エージェントは単純でハイレベルな意図だけでガイドされた未知の環境をナビゲートする自律性を持つべきである。 BVN(Beyond-the-View Navigation)では、エージェントは密集したステップバイステップのガイダンスを使わずに、遠く、見えないターゲットを離れていなければならない。既存の大規模言語モデル(LLM)に基づく手法は、厳密な指示に従わないが、短焦点監督に依存しているため、しばしば近視行動に悩まされる。しかし、監督の地平線を単純に広げるだけで、LLMトレーニングは不安定になる。本研究では,映像生成モデルが言語命令に適合する長軸監督の利点を生かし,BVNタスクに適した映像生成モデルを提案する。この知見に基づいて,この分野にビデオ生成モデルを導入することを提案する。しかし、数秒間にわたるビデオ生成の禁止的なレイテンシは、現実のデプロイメントを非現実的にします。このギャップを埋めるために、20秒の地平線にまたがるスパース未来によって導かれるサブ秒の軌跡推論を実現するSparseVideoNavを提案する。これにより、最適化されていないものに比べて27倍のスピードアップが得られる。大規模な実世界のゼロショット実験により、SparseVideoNavはBVNタスクにおける最先端のLCMベースラインの成功率を2.5倍に達成し、夜間の挑戦シーンにおいてそのような能力の初めて実現したことを示す。

論文の概要: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

関連論文リスト