Fugu-MT 論文翻訳(概要): Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

論文の概要: Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2511.14131v1
Date: Tue, 18 Nov 2025 04:32:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.934743
Title: Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation
Title（参考訳）: ランニング・ラミネート・レギュレーション:ビジョン・アンド・ランゲージ・ナビゲーションのためのデュアルプロセス思考システム
Authors: Yu Zhong, Zihao Zhang, Rui Zhang, Lingdong Huang, Haihan Gao, Shuo Wang, Da Li, Ruijian Han, Jiaming Guo, Shaohui Peng, Di Huang, Yunji Chen,
Abstract要約: VLN(Vision-and-Language Navigation)は、エージェントが人間の指示に従って複雑な3D環境を動的に探索する必要がある。近年の研究では、一般的な知識と一般的な推論能力から、大きな言語モデル(LLM)をVLNに活用する可能性を強調している。本稿では、LLMの一般化機能とVLN固有の専門知識をゼロショットで統合する、R3と呼ばれる新しいデュアルプロセス思考フレームワークを提案する。
参考スコア（独自算出の注目度）: 52.11339614452127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely. Additionally, introducing LLMs is accompanied with substantial computational cost and inference latency. To address these issues, we propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a powerful multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R3 significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark. This pronounced enhancement highlights the effectiveness of our method in handling challenging VLN tasks.
Abstract（参考訳）: VLN(Vision-and-Language Navigation)は、エージェントが人間の指示に従って複雑な3D環境を動的に探索する必要がある。近年の研究では、一般的な知識と一般的な推論能力から、大きな言語モデル(LLM)をVLNに活用する可能性を強調している。それらの長所にもかかわらず、LLMに基づくアプローチとドメインの専門家の間にはタスク完了性能のかなりのギャップが持続しており、LLMは本質的に現実世界の空間的相関を正確に理解するのに苦労している。加えて、LLMの導入には相当な計算コストと推論遅延が伴う。これらの問題に対処するため、我々はLLMの一般化機能とVLN固有の専門知識をゼロショットで統合する、R3と呼ばれる新しいデュアルプロセス思考フレームワークを提案する。フレームワークはRunner、Ruminator、Regulatorの3つのコアモジュールで構成されている。 Runnerは軽量トランスフォーマーベースのエキスパートモデルで、通常の状況下で効率よく正確なナビゲーションを実現する。ルーミネーターは強力なマルチモーダルLSMをバックボーンとして採用し、構造的推論を誘発するチェーン・オブ・シント(CoT)を採用する。レギュレータはナビゲーションの進捗を監視し、3つの基準に従って適切な思考モードを制御する。実験の結果、R3はREVERIEベンチマークでそれぞれ3.28%、RGSPLが3.30%、他の最先端の手法よりも大幅に優れていた。この拡張は,VLNの課題に対処する上で,本手法の有効性を強調している。

論文の概要: Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

関連論文リスト