Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
- URL: http://arxiv.org/abs/2506.19433v1
- Date: Tue, 24 Jun 2025 09:00:43 GMT
- Title: Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
- Authors: Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li,
- Abstract summary: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes.<n>We introduce textbfMem4Nav, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone.
- Score: 8.629870144131248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
Related papers
- CoMemo: LVLMs Need Image Context with Image Memory [51.681858871027345]
CoMemo is a dual-path architecture that combines a Context image path with an image Memory path for visual processing.<n>We introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness.
arXiv Detail & Related papers (2025-06-06T17:59:06Z) - Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons [0.0]
We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows.<n>Unlike traditional Transformer designs, which suffer from quadratic memory and overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.
arXiv Detail & Related papers (2025-05-09T00:25:46Z) - ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.<n>We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.<n>We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - MemoNav: Working Memory Model for Visual Navigation [47.011190883888446]
Image-goal navigation is a challenging task that requires an agent to navigate to a goal indicated by an image in unfamiliar environments.
Existing methods utilizing diverse scene memories suffer from inefficient exploration since they use all historical observations for decision-making.
We present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance.
arXiv Detail & Related papers (2024-02-29T13:45:13Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Sparse Graphical Memory for Robust Planning [93.39298821537197]
We introduce Sparse Graphical Memory (SGM), a new data structure that stores states and feasible transitions in a sparse memory.
SGM aggregates states according to a novel two-way consistency objective, adapting classic state aggregation criteria to goal-conditioned RL.
We show that SGM significantly outperforms current state of the art methods on long horizon, sparse-reward visual navigation tasks.
arXiv Detail & Related papers (2020-03-13T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.