Related papers: JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

URL: http://arxiv.org/abs/2509.22548v1
Date: Fri, 26 Sep 2025 16:29:37 GMT
Title: JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Authors: Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei,
Abstract summary: Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream.<n>Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models.<n>We propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations.
Score: 22.956416709470503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv-xjtu.github.io/JanusVLN.github.io/.

Related papers

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z)
Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes [23.445409551683213]
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments.<n>We propose a novel approach to address challenges in semantic feature misalignment and inefficiency in memory and runtime.<n>We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
arXiv Detail & Related papers (2025-12-08T18:39:58Z)
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory [50.81667005063605]
ViLoMem is a dual-stream memory framework that constructs compact, schema-based memory.<n>It encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences.
arXiv Detail & Related papers (2025-11-26T18:55:08Z)
Vision-Language Memory for Spatial Reasoning [4.486751990718678]
We present VLM$2$, a vision-language model with persistent memory for spatial reasoning.<n>We show that VLM$2$ achieves state-of-the-art performance among video-only models.
arXiv Detail & Related papers (2025-11-25T18:59:02Z)
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System [8.629870144131248]
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes.<n>We introduce textbfMem4Nav, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone.
arXiv Detail & Related papers (2025-06-24T09:00:43Z)
NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features [50.212836834889146]
We propose an efficient and novel visual localization approach based on the neural implicit map with complementary features.<n>Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field.<n>To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields.
arXiv Detail & Related papers (2025-03-08T08:04:27Z)
Language Driven Occupancy Prediction [13.35971455725581]
We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy prediction.<n>Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels.<n>By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume.
arXiv Detail & Related papers (2024-11-25T03:47:10Z)
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm. We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z)
Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention.<n>We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction. We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z)
Robust High-dimensional Memory-augmented Neural Networks [13.82206983716435]
Memory-augmented neural networks enhance neural networks with an explicit memory to overcome these issues. Access to this explicit memory occurs via soft read and write operations involving every individual memory entry. We propose a robust architecture that employs a computational memory unit as the explicit memory performing analog in-memory computation on high-dimensional (HD) vectors.
arXiv Detail & Related papers (2020-10-05T12:01:56Z)
MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction [26.151761714896118]
We address the problem of multimodal trajectory prediction exploiting a Memory Augmented Neural Network. Our method learns past and future trajectory embeddings using recurrent neural networks and exploits an associative external memory to store and retrieve such embeddings. Trajectory prediction is then performed by decoding in-memory future encodings conditioned with the observed past.
arXiv Detail & Related papers (2020-06-05T09:49:59Z)
Sparse Graphical Memory for Robust Planning [93.39298821537197]
We introduce Sparse Graphical Memory (SGM), a new data structure that stores states and feasible transitions in a sparse memory. SGM aggregates states according to a novel two-way consistency objective, adapting classic state aggregation criteria to goal-conditioned RL. We show that SGM significantly outperforms current state of the art methods on long horizon, sparse-reward visual navigation tasks.
arXiv Detail & Related papers (2020-03-13T17:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.