Related papers: Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation

Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation

URL: http://arxiv.org/abs/2511.00933v1
Date: Sun, 02 Nov 2025 13:21:54 GMT
Title: Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation
Authors: Xiangyu Shi, Zerui Li, Yanyuan Qiao, Qi Wu,
Abstract summary: Fast-SmartWay is an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors.<n>Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions.
Score: 16.632191523127865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Vision-and-Language Navigation in Continuous Environments (VLN-CE) have leveraged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast-SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that integrates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method significantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.

Related papers

Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation [69.94565127141483]
Current approaches separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability.<n>We propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone.<n>We show that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset.
arXiv Detail & Related papers (2025-10-09T18:18:11Z)
VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation [52.00474922315126]
We present VLN-Zero, a vision-language navigation framework for unseen environments.<n>We use vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.<n>VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time.
arXiv Detail & Related papers (2025-09-23T03:23:03Z)
DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
A Navigation Framework Utilizing Vision-Language Models [0.0]
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI.<n>Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding.<n>We propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning.
arXiv Detail & Related papers (2025-06-11T20:51:58Z)
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation [12.152477445938759]
Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces.<n>Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements.<n>We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator.
arXiv Detail & Related papers (2025-03-13T05:32:57Z)
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation [35.71602601385161]
We present a novel vision-language model (VLM)-based navigation framework.<n>Our approach enhances spatial reasoning and decision-making in long-horizon tasks.<n> Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks.
arXiv Detail & Related papers (2025-02-20T04:41:40Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.<n>It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.<n>UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.