STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization
- URL: http://arxiv.org/abs/2511.00033v1
- Date: Mon, 27 Oct 2025 04:37:21 GMT
- Title: STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization
- Authors: Diqi He, Xuehao Gao, Hao Li, Junwei Han, Dingwen Zhang,
- Abstract summary: VLN-CE task requires agents to navigate 3D environments using natural language instructions, without any scene-specific training.<n>Existing methods often fail to achieve robust navigation due to a lack of structured decision-making and insufficient integration of feedback from previous actions.<n>We propose STRIDER, a novel framework that systematically optimize the agent's decision space by integrating spatial layout priors and dynamic task feedback.<n>Our approach introduces two key innovations: 1) a Structured Waypoint Generator that constrains the action space through spatial structure, and 2) a Task-Alignment Regulator that adjusts behavior based on task progress, ensuring semantic alignment throughout navigation.
- Score: 73.98141357780032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) task requires agents to navigate previously unseen 3D environments using natural language instructions, without any scene-specific training. A critical challenge in this setting lies in ensuring agents' actions align with both spatial structure and task intent over long-horizon execution. Existing methods often fail to achieve robust navigation due to a lack of structured decision-making and insufficient integration of feedback from previous actions. To address these challenges, we propose STRIDER (Instruction-Aligned Structural Decision Space Optimization), a novel framework that systematically optimizes the agent's decision space by integrating spatial layout priors and dynamic task feedback. Our approach introduces two key innovations: 1) a Structured Waypoint Generator that constrains the action space through spatial structure, and 2) a Task-Alignment Regulator that adjusts behavior based on task progress, ensuring semantic alignment throughout navigation. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate that STRIDER significantly outperforms strong SOTA across key metrics; in particular, it improves Success Rate (SR) from 29% to 35%, a relative gain of 20.7%. Such results highlight the importance of spatially constrained decision-making and feedback-guided execution in improving navigation fidelity for zero-shot VLN-CE.
Related papers
- RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - Nav-R1: Reasoning and Navigation in Embodied Scenes [16.10022718760368]
Em navigation requires agents to integrate perception, reasoning, and action.<n>Existing approaches often suffer from incoherent and unstable reasoning traces.<n>We propose Nav-R1, an embodied foundation model that unifies reasoning in embodied environments.
arXiv Detail & Related papers (2025-09-13T16:31:03Z) - GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation [61.34589819350429]
We propose a training-free framework for vision-and-language navigation (VLN)<n>Our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints.<n>Our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.
arXiv Detail & Related papers (2025-09-12T17:59:58Z) - DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z) - Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents [43.5771856761934]
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments.<n>We propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents.
arXiv Detail & Related papers (2025-08-11T05:50:30Z) - Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations [4.483463511271561]
Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions.<n>Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments.<n>We propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations.
arXiv Detail & Related papers (2025-06-10T08:36:51Z) - Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments [19.818370526976974]
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI.
We introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks.
Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes.
arXiv Detail & Related papers (2024-09-04T08:30:03Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation.
It incorporates environmental feedback for refining future plans and adjusting its actions.
It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.