Fugu-MT 論文翻訳(概要): HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

論文の概要: HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

arxiv url: http://arxiv.org/abs/2603.14807v1
Date: Mon, 16 Mar 2026 04:23:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:36.048162
Title: HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System
Title（参考訳）: HiMemVLN:階層型メモリシステムによるオープンソースのゼロショットビジョン・ランゲージナビゲーションの信頼性向上
Authors: Kailin Lyu, Kangyi Wu, Pengna Li, Xiuyu Hu, Qingyi Si, Cui Miao, Ning Yang, Zihang Wang, Long Xiao, Lianyu Hu, Jingyuan Sun, Ce Hao,
Abstract要約: 階層型メモリシステムをマルチモーダルな大モデルに組み込んだHiMeVLNを提案する。 HiMeVLNは,実環境とシミュレーション環境の両方で実験を行い,オープンソース手法の約2倍の性能を発揮することを示した。
参考スコア（独自算出の注目度）: 12.907741491900731
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.
Abstract（参考訳）: LLMベースのエージェントは視覚言語ナビゲーション(VLN)タスクにおいて印象的なゼロショット性能を示した。しかしながら、ほとんどのゼロショット法は、高いトークンコストと潜在的なデータ漏洩リスクに関連する課題に直面しているナビゲータとして、クローズドソースのLCMに依存している。近年の取り組みでは、オープンソースLLMと時空間CoTフレームワークを組み合わせることでこの問題に対処している。本研究では,ナビゲーションプロセスの詳細な解析を通じて,重要な問題であるナビゲーション・アムネシアを同定する。この問題はナビゲーションの障害を引き起こし、オープンソースとクローズドソースメソッドのギャップを拡大する。これを解決するために,階層型メモリシステムをマルチモーダルな大モデルに組み込んだHiMemVLNを提案する。 HiMemVLNは、シミュレーションと実環境の両方で大規模な実験を行い、オープンソースの最先端手法の約2倍の性能を発揮することを示した。コードはhttps://github.com/lvkailin0118/HiMemVLNで公開されている。

論文の概要: HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

関連論文リスト