VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation
- URL: http://arxiv.org/abs/2509.18592v1
- Date: Tue, 23 Sep 2025 03:23:03 GMT
- Title: VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation
- Authors: Neel P. Bhatt, Yunhao Yang, Rohan Siva, Pranay Samineni, Daniel Milan, Zhangyang Wang, Ufuk Topcu,
- Abstract summary: We present VLN-Zero, a vision-language navigation framework for unseen environments.<n>We use vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.<n>VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time.
- Score: 52.00474922315126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.
Related papers
- SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [48.17712857341527]
We introduce a zero-shot vision-and-language navigation (VLN) agent that integrates an agent-centric spatial map, a compass-aligned visual representation, and a remote object localization strategy for efficient navigation.<n>Experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero-shot agents and clearly narrows the gap with state-of-the-art learning-based methods.
arXiv Detail & Related papers (2026-01-11T08:39:19Z) - Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation [16.632191523127865]
Fast-SmartWay is an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors.<n>Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions.
arXiv Detail & Related papers (2025-11-02T13:21:54Z) - Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting [18.325003967982827]
Vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications.<n>We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM)<n>Experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively.
arXiv Detail & Related papers (2025-09-24T19:21:39Z) - GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation [61.34589819350429]
We propose a training-free framework for vision-and-language navigation (VLN)<n>Our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints.<n>Our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.
arXiv Detail & Related papers (2025-09-12T17:59:58Z) - VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z) - UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.<n>It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.<n>UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.