Fugu-MT 論文翻訳(概要): Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

論文の概要: Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2602.15724v1
Date: Tue, 17 Feb 2026 17:00:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-18 16:03:18.132622
Title: Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
Title（参考訳）: 効率的な視覚・言語ナビゲーションのためのナビゲート候補抽出学習
Authors: Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao,
Abstract要約: VLN(Vision-and-Language Navigation)は、エージェントが自然言語の指示に従い、これまで見えなかった環境をナビゲートする必要がある。本稿では,基礎となる言語モデルの変更や微調整を行うことなく,VLNを改善するための検索拡張フレームワークを提案する。
参考スコア（独自算出の注目度）: 15.242490558864626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.
Abstract（参考訳）: VLN(Vision-and-Language Navigation)は、エージェントが自然言語の指示に従い、これまで見えなかった環境をナビゲートする必要がある。最近のアプローチでは、柔軟性と推論能力のために、大きな言語モデル(LLM)をハイレベルなナビゲータとして採用している。しかしながら、プロンプトベースのLLMナビゲーションは、しばしば非効率な意思決定に悩まされる。本稿では,LLMに基づくVLNの効率性と安定性を,基礎となる言語モデルの変更や微調整を伴わずに向上するフレームワークを提案する。提案手法は2つの相補的なレベルで検索を導入する。エピソードレベルでは、命令レベルの埋め込みレトリバーは、意味的に類似した成功したナビゲーショントラジェクトリをインコンテキストの例示として選択し、命令グラウンドのタスク固有の先行情報を提供する。ステップレベルでは、模倣学習された候補レトリバーは、LSM推論の前に無関係なナビゲートを行い、動作の曖昧さを低減し、複雑さを促進させる。どちらの検索モジュールも軽量でモジュール式であり、LLMとは独立して訓練されている。提案手法をRoom-to-Room(R2R)ベンチマークで評価した。実験結果は、目に見えない環境と見えない環境の両方において、成功率、Oracle成功率、SPLが一貫した改善を示している。アブレーション研究では, 指導レベルの模範的検索と候補決定が, グローバルガイダンスとステップワイズ意思決定効率に相補的利益をもたらすことが示されている。これらの結果は,LLMに基づく視覚・言語ナビゲーションを向上するための,検索強化意思決定支援が効果的かつスケーラブルな戦略であることを示唆している。

論文の概要: Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

関連論文リスト