Fugu-MT 論文翻訳(概要): GoViG: Goal-Conditioned Visual Navigation Instruction Generation

論文の概要: GoViG: Goal-Conditioned Visual Navigation Instruction Generation

arxiv url: http://arxiv.org/abs/2508.09547v1
Date: Wed, 13 Aug 2025 07:05:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.79285
Title: GoViG: Goal-Conditioned Visual Navigation Instruction Generation
Title（参考訳）: GoViG: ゴールコンディションのビジュアルナビゲーション命令生成
Authors: Fengyi Wu, Yifei Dong, Zhi-Qi Cheng, Yilong Dai, Guangyu Chen, Hang Wang, Qi Dai, Alexander G. Hauptmann,
Abstract要約: 本稿では,Goal-Conditioned Visual Navigation Instruction Generation (GoViG)を紹介する。 GoViGは生のエゴセントリックな視覚データのみを活用し、目に見えない非構造環境への適応性を大幅に改善する。
参考スコア（独自算出の注目度）: 69.79110149746506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.
Abstract（参考訳）: 本稿では,Goal-Conditioned Visual Navigation Instruction Generation (GoViG)について紹介する。セマンティックアノテーションや環境マップのような構造化された入力に依存する従来のアプローチとは異なり、GoViGは生のエゴセントリックな視覚データのみを活用し、目に見えない環境への適応性を著しく改善している。本手法は,(1)初期視点と目標視点を橋渡しする中間視覚状態を予測する視覚予測,(2)観察および予測された視覚の両方に根ざした言語的に一貫性のある指示を合成する命令生成という2つの相互接続サブタスクに分解することで,この問題に対処する。これらのサブタスクは、空間的精度と言語的明瞭さを確保するために、調整された目的によって訓練された自己回帰多モーダルな大規模言語モデルに統合される。さらに、ナビゲーション中の段階的な人間の認知過程を模倣する2つの補完的マルチモーダル推論戦略、ワンパス推論とインターリーブ推論を導入する。提案手法を評価するために,R2R-Goalデータセットを提案する。実験により、最先端の手法よりも優れたBLEU-4とCIDErのスコアと堅牢なクロスドメインの一般化が得られた。

論文の概要: GoViG: Goal-Conditioned Visual Navigation Instruction Generation

関連論文リスト