VLD: Visual Language Goal Distance for Reinforcement Learning Navigation
- URL: http://arxiv.org/abs/2512.07976v1
- Date: Mon, 08 Dec 2025 19:05:51 GMT
- Title: VLD: Visual Language Goal Distance for Reinforcement Learning Navigation
- Authors: Lazar Milikic, Manthan Patel, Jonas Frey,
- Abstract summary: We introduce Vision-Language Distance (VLD) learning, a framework for goal-conditioned navigation.<n>We first train a self-supervised distance-to-goal predictor on internet-scale video data.<n>This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning policy.
- Score: 5.225089020389076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
Related papers
- VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory [43.2995099083993]
VLA models have shown promising potential in embodied navigation by unifying perception and planning.<n>Most existing VLA models rely on reactive mappings directly from observations to actions.<n>We propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition.
arXiv Detail & Related papers (2026-01-13T15:43:43Z) - DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z) - VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z) - RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation [57.197881161006904]
Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images.<n>We propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance.
arXiv Detail & Related papers (2025-04-25T00:22:17Z) - RAPID: Robust and Agile Planner Using Inverse Reinforcement Learning for Vision-Based Drone Navigation [9.25068777307471]
This paper introduces a learning-based visual planner for agile drone flight in cluttered environments.<n>The proposed planner generates collision-free waypoints in milliseconds, enabling drones to perform agile maneuvers in complex environments without building separate perception, mapping, and planning modules.
arXiv Detail & Related papers (2025-02-04T06:42:08Z) - MPVO: Motion-Prior based Visual Odometry for PointGoal Navigation [3.9974562667271507]
Visual odometry (VO) is essential for enabling accurate point-goal navigation of embodied agents in indoor environments.
Recent deep-learned VO methods show robust performance but suffer from sample inefficiency during training.
We propose a robust and sample-efficient VO pipeline based on motion priors available while an agent is navigating an environment.
arXiv Detail & Related papers (2024-11-07T15:36:49Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Learning for Visual Navigation by Imagining the Success [66.99810227193196]
We propose to learn to imagine a latent representation of the successful (sub-)goal state.
ForeSIT is trained to imagine the recurrent latent representation of a future state that leads to success.
We develop an efficient learning algorithm to train ForeSIT in an on-policy manner and integrate it into our RL objective.
arXiv Detail & Related papers (2021-02-28T10:25:46Z) - An A* Curriculum Approach to Reinforcement Learning for RGBD Indoor
Robot Navigation [6.660458629649825]
Recently released photo-realistic simulators such as Habitat allow for the training of networks that output control actions directly from perception.
Our paper tries to overcome this problem by separating the training of the perception and control neural nets and increasing the path complexity gradually.
arXiv Detail & Related papers (2021-01-05T20:35:14Z) - Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.<n>One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.<n>We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.