Related papers: MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

URL: http://arxiv.org/abs/2508.16654v3
Date: Wed, 10 Sep 2025 11:47:41 GMT
Title: MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning
Authors: Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, Huiling Duan,
Abstract summary: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments.<n>Current approaches often adopt a "black-box" paradigm, where a single Large Language Model (LLM) makes end-to-end decisions.<n>We propose Memory Spatial Navigation(MSNav), a framework that fuses three modules into a synergistic architecture.
Score: 34.08837799906451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Current approaches often adopt a "black-box" paradigm, where a single Large Language Model (LLM) makes end-to-end decisions. However, it is plagued by critical vulnerabilities, including poor spatial reasoning, weak cross-modal grounding, and memory overload in long-horizon tasks. To systematically address these issues, we propose Memory Spatial Navigation(MSNav), a framework that fuses three modules into a synergistic architecture, which transforms fragile inference into a robust, integrated intelligence. MSNav integrates three modules: Memory Module, a dynamic map memory module that tackles memory overload through selective node pruning, enhancing long-range exploration; Spatial Module, a module for spatial reasoning and object relationship inference that improves endpoint recognition; and Decision Module, a module using LLM-based path planning to execute robust actions. Powering Spatial Module, we also introduce an Instruction-Object-Space (I-O-S) dataset and fine-tune the Qwen3-4B model into Qwen-Spatial (Qwen-Sp), which outperforms leading commercial LLMs in object list extraction, achieving higher F1 and NDCG scores on the I-O-S test set. Extensive experiments on the Room-to-Room (R2R) and REVERIE datasets demonstrate MSNav's state-of-the-art performance with significant improvements in Success Rate (SR) and Success weighted by Path Length (SPL).

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation [26.546610806602803]
Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description.<n>Existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering.<n>We introduce textAPEX, a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings.
arXiv Detail & Related papers (2026-01-31T06:27:57Z)
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension [55.29309306566238]
Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents.<n>This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents.<n>We draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation.
arXiv Detail & Related papers (2025-10-07T02:16:30Z)
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment [79.98946571424607]
We present OmniBridge, a unified framework that supports vision-language understanding, generation, and retrieval within a unified architecture.<n>To address the challenge of task interference, we propose a two-stage decoupled training strategy.<n>Experiments demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks.
arXiv Detail & Related papers (2025-09-23T13:57:55Z)
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z)
Memory-Centric Embodied Question Answer [39.3863762723862]
Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions.<n>We propose a memory-centric EQA framework named MemoryEQA.<n>Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules.
arXiv Detail & Related papers (2025-05-20T05:27:57Z)
LM2: Large Memory Models [11.320069795732058]
This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module.<n> Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks.
arXiv Detail & Related papers (2025-02-09T22:11:42Z)
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing large language models (LLMs) by integrating a structured and explicit read-and-write memory module.<n>Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular.
arXiv Detail & Related papers (2024-04-17T18:13:16Z)
ModuleFormer: Modularity Emerges from Mixture-of-Experts [60.6148988099284]
This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models. Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
arXiv Detail & Related papers (2023-06-07T17:59:57Z)
End-to-End Egospheric Spatial Memory [32.42361470456194]
We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent. ESM can be trained end-to-end via either imitation or reinforcement learning. We show applications to semantic segmentation on the ScanNet dataset, where ESM naturally combines image-level and map-level inference modalities.
arXiv Detail & Related papers (2021-02-15T18:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.