Fugu-MT 論文翻訳(概要): FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

論文の概要: FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

arxiv url: http://arxiv.org/abs/2601.13976v2
Date: Fri, 23 Jan 2026 08:44:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-26 14:27:27.315778
Title: FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
Title（参考訳）: FantasyVLN:視覚言語ナビゲーションのための統合マルチモーダル・チェーン・オブ・ソート推論
Authors: Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi,
Abstract要約: VLN(Vision-and-Language Navigation)は、マルチモーダル命令と視覚空間コンテキストを協調的に理解するために、エンボディエージェントを必要とする。最近の研究は、解釈可能性と長期計画を改善するために、CoT(Chain-of-Thought)推論の可能性を示している。明示的なトークンオーバーヘッドを伴わずにCoT推論の利点を保った暗黙的推論フレームワークであるFantasyVLNを提案する。
参考スコア（独自算出の注目度）: 11.18316873483782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.
Abstract（参考訳）: VLN(Vision-and-Language Navigation)における人間レベルのパフォーマンスを達成するには、長いアクションシーケンスを推論しながら、マルチモーダル命令と視覚空間コンテキストを協調的に理解する必要がある。 NavCoTやNavGPT-2といった最近の研究は、解釈可能性や長期計画を改善するために、CoT(Chain-of-Thought)の可能性を実証している。さらに、OctoNav-R1やCoT-VLAのようなマルチモーダル拡張は、CoTを人間のようなナビゲーション推論への有望な経路として検証する。しかし、既存のアプローチは重大な欠点に直面している: 純粋にテキストCoTは空間的根拠がなく、注釈付き推論ステップを疎結合に過度に扱える一方で、マルチモーダルCoTは想像上の視覚的観察を生成して深刻なトークンインフレーションを引き起こし、リアルタイムナビゲーションを非現実的なものにする。本稿では,CoT推論の利点を明示的なトークンオーバーヘッドなしに維持する,統一的な暗黙的推論フレームワークであるFantasyVLNを提案する。具体的には、CoT推論トレーニング中に事前訓練されたVisual AutoRegressor(VAR)を使用して、仮想トークンをコンパクトな潜在空間に符号化し、マルチCoT戦略の下でテキストモード、ビジュアルモード、マルチモーダルモードから共同で学習する。推論において,本モデルは推論対応表現を楽しみながら直接命令対アクションマッピングを行う。 LH-VLNの大規模な実験により,提案手法は推論を意識したリアルタイムナビゲーションを実現し,成功率と効率を向上させるとともに,明示的なCoT法に比べて推論遅延を桁違いに低減することを示した。

論文の概要: FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

関連論文リスト