Fugu-MT 論文翻訳(概要): DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

論文の概要: DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2601.18492v1
Date: Mon, 26 Jan 2026 13:47:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.949407
Title: DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation
Title（参考訳）: DV-VLN:信頼性LLMに基づく視覚・言語ナビゲーションのための二重検証
Authors: Zijun Li, Shijie Li, Zhenxi Zhang, Bin Li, Shoujun Zhou,
Abstract要約: VLN(Vision-and-Language Navigation)は、自然言語の指示に従って複雑な3D環境をナビゲートするために、エンボディエージェントを必要とする。大規模言語モデル(LLM)の最近の進歩により、言語駆動ナビゲーションが実現され、解釈性が改善されている。 DV-VLNはジェネレーション・then-verifyパラダイムに従う新しいVLNフレームワークである。
参考スコア（独自算出の注目度）: 18.493700097379186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language-driven navigation with improved interpretability. However, most LLM-based agents still rely on single-shot action decisions, where the model must choose one option from noisy, textualized multi-perspective observations. Due to local mismatches and imperfect intermediate reasoning, such decisions can easily deviate from the correct path, leading to error accumulation and reduced reliability in unseen environments. In this paper, we propose DV-VLN, a new VLN framework that follows a generate-then-verify paradigm. DV-VLN first performs parameter-efficient in-domain adaptation of an open-source LLaMA-2 backbone to produce a structured navigational chain-of-thought, and then verifies candidate actions with two complementary channels: True-False Verification (TFV) and Masked-Entity Verification (MEV). DV-VLN selects actions by aggregating verification successes across multiple samples, yielding interpretable scores for reranking. Experiments on R2R, RxR (English subset), and REVERIE show that DV-VLN consistently improves over direct prediction and sampling-only baselines, achieving competitive performance among language-only VLN agents and promising results compared with several cross-modal systems.Code is available at https://github.com/PlumJun/DV-VLN.
Abstract（参考訳）: VLN(Vision-and-Language Navigation)は、自然言語の指示に従って複雑な3D環境をナビゲートするために、エンボディエージェントを必要とする。大規模言語モデル(LLM)の最近の進歩により、言語駆動ナビゲーションが実現され、解釈性が改善されている。しかし、ほとんどのLCMベースのエージェントはシングルショットのアクション決定に依存しており、そこでは、ノイズの多いテキスト化された多視点観察から1つの選択肢を選択する必要がある。局所的なミスマッチと不完全な中間推論のため、そのような決定は正しい経路から容易に逸脱し、エラーの蓄積と予期せぬ環境における信頼性の低下につながる。本稿では,DV-VLNを提案する。DV-VLNはジェネレーション・then-verifyパラダイムに従う新しいVLNフレームワークである。 DV-VLNは、まずオープンソースのLLaMA-2バックボーンのパラメータ効率の高いドメイン内適応を行い、構造化されたナビゲーションチェーンを生成、次に、True-False Verification (TFV) と Masked-Entity Verification (MEV) の2つの補完チャネルで候補動作を検証する。 DV-VLNは、複数のサンプル間で検証成功を集約してアクションを選択し、再ランク付けのための解釈可能なスコアを得る。 R2R、RxR(英字サブセット)、REVERIEの実験では、DV-VLNは直接予測とサンプリング専用ベースラインよりも一貫して改善され、言語のみのVLNエージェント間の競合性能と、いくつかのクロスモーダルシステムと比較して有望な結果が得られる。

論文の概要: DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

関連論文リスト