Related papers: DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

URL: http://arxiv.org/abs/2508.09444v1
Date: Wed, 13 Aug 2025 02:51:43 GMT
Title: DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation
Authors: Haoxiang Shi, Xiang Deng, Zaijing Li, Gongwei Chen, Yaowei Wang, Liqiang Nie,
Abstract summary: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
Score: 73.80968452950854
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy's robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: https://github.com/Tokishx/DifNav.

Related papers

ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation [53.95797153529148]
Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations.<n>We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners.
arXiv Detail & Related papers (2026-01-26T19:09:20Z)
VLD: Visual Language Goal Distance for Reinforcement Learning Navigation [5.225089020389076]
We introduce Vision-Language Distance (VLD) learning, a framework for goal-conditioned navigation.<n>We first train a self-supervised distance-to-goal predictor on internet-scale video data.<n>This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning policy.
arXiv Detail & Related papers (2025-12-08T19:05:51Z)
Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation [67.68165784193556]
Nav-$R2$ is a framework that explicitly models two types of relationships, target-environment modeling and environment-action planning.<n>Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives.<n>Nav-R2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline.
arXiv Detail & Related papers (2025-12-02T04:21:02Z)
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation [12.152477445938759]
Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces.<n>Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements.<n>We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator.
arXiv Detail & Related papers (2025-03-13T05:32:57Z)
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z)
Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z)
Versatile Navigation under Partial Observability via Value-guided Diffusion Policy [14.967107015417943]
We propose a versatile diffusion-based approach for both 2D and 3D route planning under partial observability. Specifically, our value-guided diffusion policy first generates plans to predict actions across various timesteps. We then employ a differentiable planner with state estimations to derive a value function, directing the agent's exploration and goal-seeking behaviors.
arXiv Detail & Related papers (2024-04-01T19:52:08Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration [57.15811390835294]
This paper describes how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods.
arXiv Detail & Related papers (2023-10-11T21:07:14Z)
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z)
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation [41.334731014665316]
Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments. We propose a predictor to generate a set of candidate waypoints during navigation. We show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions.
arXiv Detail & Related papers (2022-03-05T14:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.