Fugu-MT 論文翻訳(概要): SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

論文の概要: SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

arxiv url: http://arxiv.org/abs/2604.27620v1
Date: Thu, 30 Apr 2026 09:09:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.015323
Title: SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
Title（参考訳）: SpaAct:視覚言語ナビゲーションのためのカリキュラム適応型空間活性化遷移学習
Authors: Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Hanbing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi-Xin Yang, Nanning Zheng,
Abstract要約: Vision-and-Language Navigation (VLN)は、インボディードエージェントが自然言語の指示に従い、目に見えない3D環境のターゲット場所へナビゲートできるようにすることを目的としている。我々は、VLMをVLNに適応させるには、そのような認識を得るための2つの補完的な能力を与える必要があると論じている。本稿では,VLMにおける動的空間認識を活性化するトレーニングフレームワークであるSpaActを提案する。
参考スコア（独自算出の注目度）: 40.318940817058746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward transition prediction, encouraging the model to build dynamic spatial awareness in a VLM-friendly way. To further stabilize adaptation, we design TriPA, a Tri-factor Progressive Adaptive curriculum learning method that organizes training samples from easy to hard, allowing the model to gradually acquire navigation skills from basic locomotion to long-horizon reasoning. Experiments on standard VLN-CE benchmarks show that SpaAct consistently improves VLM-based navigation and achieves state-of-the-art performance. We will release the code and models to support future research.
Abstract（参考訳）: Vision-and-Language Navigation (VLN)は、インボディードエージェントが自然言語の指示に従い、目に見えない3D環境のターゲット場所へナビゲートできるようにすることを目的としている。我々は、VLMをVLNに適応させるには、そのような認識を得るための2つの補完的能力、すなわち、後進行動推論(なぜ)と前進遷移予測(どのように)が必要であると論じる。この知見に基づいて,VLMにおける動的空間認識を活性化する,シンプルで効果的なトレーニングフレームワークであるSpaActを提案する。特に、SpaActは2つの空間的アクティベーションタスクを導入している。Action Retrospectionは、実行されたアクションシーケンスを視覚的遷移から推測するようモデルに求め、Future Frame Selectionは、モデルに履歴とアクションに条件付けられた視覚的遷移を予測させる。これらの2つの目的は、後方行動推論と前方遷移予測の両方を軽量に監視し、VLMフレンドリーな方法で動的空間認識を構築するようモデルに促す。さらに適応性を高めるために,3要素プログレッシブ適応学習法であるTriPAを設計し,トレーニングサンプルを簡単から困難に整理し,基本動作から長距離推論までのナビゲーションスキルを段階的に獲得する。標準的なVLN-CEベンチマークの実験では、SpaActは一貫してVLMベースのナビゲーションを改善し、最先端のパフォーマンスを達成する。将来の研究をサポートするためのコードとモデルをリリースします。

論文の概要: SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

関連論文リスト