Fugu-MT 論文翻訳(概要): CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

論文の概要: CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2512.10360v2
Date: Fri, 23 Jan 2026 08:54:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.277241
Title: CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation
Title（参考訳）: CLASH: 連続視覚・言語ナビゲーションのための協調型大規模階層型フレームワーク
Authors: Liuyi Wang, Zongtao He, Jinlong Li, Ruihao Xia, Mengxian Hu, Chenpeng Yao, Chengju Liu, Yang Tang, Qijun Chen,
Abstract要約: VLN(Vision-and-Language Navigation)では、ロボットが自然言語の指示に従い、事前の地図を使わずに複雑な環境をナビゲートする必要がある。反応型小型モデルプランナ (RSMP) と反射型大型モデル推論器 (RLMR) を統合した VLN-CE フレームワーク CLASH を提案する。シミュレーションでは,ルールベースのコントローラを完全学習可能なポイントゴールポリシーに置き換え,実世界の展開では,ナビゲーション可能なウェイポイントを生成するためのLiDARベースのクラスタリングモジュールを設計する。
参考スコア（独自算出の注目度）: 38.37757288365413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.
Abstract（参考訳）: VLN(Vision-and-Language Navigation)では、ロボットが自然言語の指示に従い、事前の地図を使わずに複雑な環境をナビゲートする必要がある。最近の視覚言語による大規模モデルは強い推論能力を示すが、VLNタスクではタスク固有のパノラマ小モデルよりも性能が低いことが多い。これを解決するために,反応型小モデルプランナ(RSMP)と反射型大モデル推論器(RLMR)を統合したVLN-CEフレームワークであるCLASH(Collaborative Large-Small Hierarchy)を提案する。 RSMPは因果学習に基づくデュアルブランチアーキテクチャを採用し、一般化を強化し、RLMRはパノラマ的な視覚的プロンプトを利用して、解釈可能な空間的理解とナビゲーションをサポートする。さらに、両モデルから意思決定を適応的に融合させる不確実性認識協調機構(UCM)を導入する。障害物回避のため、シミュレーションでは、ルールベースのコントローラを完全学習可能なポイントゴールポリシーに置き換え、実世界の展開では、ナビゲーション可能なウェイポイントを生成するためのLiDARベースのクラスタリングモジュールを設計し、オンラインSLAMベースのローカルコントローラと組み合わせる。 CLASHは、VLN-CEのリーダーボード上で、最新技術(SoTA)の結果(最高1位)を達成し、以前のSoTAメソッドよりもテスト未確認セットのSRとSPLを大幅に改善した。実世界の実験では、CLASHの強い堅牢性を実証し、シミュレーションとデプロイメントのシナリオでの有効性を検証する。

論文の概要: CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

関連論文リスト