Related papers: Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration

Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration

URL: http://arxiv.org/abs/2601.12766v1
Date: Mon, 19 Jan 2026 06:53:02 GMT
Title: Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration
Authors: Lu Yue, Yue Fan, Shiwei Lian, Yu Zhao, Jiaxin Yu, Liang Xie, Feitian Zhang,
Abstract summary: Vision-and-Language Navigation (VLN) agents leveraging Large Language Models (LLMs) excel in generalization but suffer from insufficient spatial perception.<n>We present Spatial-VLN, a perception-guided exploration framework designed to overcome these challenges.
Score: 16.651645602449577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot Vision-and-Language Navigation (VLN) agents leveraging Large Language Models (LLMs) excel in generalization but suffer from insufficient spatial perception. Focusing on complex continuous environments, we categorize key perceptual bottlenecks into three spatial challenges: door interaction,multi-room navigation, and ambiguous instruction execution, where existing methods consistently suffer high failure rates. We present Spatial-VLN, a perception-guided exploration framework designed to overcome these challenges. The framework consists of two main modules. The Spatial Perception Enhancement (SPE) module integrates panoramic filtering with specialized door and region experts to produce spatially coherent, cross-view consistent perceptual representations. Building on this foundation, our Explored Multi-expert Reasoning (EMR) module uses parallel LLM experts to address waypoint-level semantics and region-level spatial transitions. When discrepancies arise between expert predictions, a query-and-explore mechanism is activated, prompting the agent to actively probe critical areas and resolve perceptual ambiguities. Experiments on VLN-CE demonstrate that Spatial VLN achieves state-of-the-art performance using only low-cost LLMs. Furthermore, to validate real-world applicability, we introduce a value-based waypoint sampling strategy that effectively bridges the Sim2Real gap. Extensive real-world evaluations confirm that our framework delivers superior generalization and robustness in complex environments. Our codes and videos are available at https://yueluhhxx.github.io/Spatial-VLN-web/.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation [26.546610806602803]
Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description.<n>Existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering.<n>We introduce textAPEX, a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings.
arXiv Detail & Related papers (2026-01-31T06:27:57Z)
SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [48.17712857341527]
We introduce a zero-shot vision-and-language navigation (VLN) agent that integrates an agent-centric spatial map, a compass-aligned visual representation, and a remote object localization strategy for efficient navigation.<n>Experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero-shot agents and clearly narrows the gap with state-of-the-art learning-based methods.
arXiv Detail & Related papers (2026-01-11T08:39:19Z)
City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs [13.863236619171174]
Task is designed to evaluate sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments.<n>We operationalize this task with CityNav, a benchmark encompassing four diverse global cities.<n>Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points.<n>We propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map.
arXiv Detail & Related papers (2025-12-17T19:59:31Z)
SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts [15.606672242024423]
We present SkyMoE, a vision-language model tailored for multimodal, multi-task remote sensing interpretation.<n>SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions.<n>Experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks.
arXiv Detail & Related papers (2025-12-02T08:17:16Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning [36.588008658084895]
Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning.<n>Our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems.<n>We enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities.
arXiv Detail & Related papers (2024-10-21T16:26:09Z)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.