Related papers: OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

URL: http://arxiv.org/abs/2603.05377v1
Date: Thu, 05 Mar 2026 17:02:22 GMT
Title: OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
Authors: Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum,
Abstract summary: Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
Score: 54.661157616245966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

Related papers

TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals [10.69725316052444]
We present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation.<n>Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles.<n>We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability.
arXiv Detail & Related papers (2025-09-10T15:43:32Z)
NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving [10.597463021650382]
NavigScene is an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems.<n>We develop three paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion.
arXiv Detail & Related papers (2025-07-07T17:37:01Z)
Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding [1.280979348722635]
Vision-and-language navigation (VLN) is a long-standing challenge in autonomous robotics, aiming to empower agents with the ability to follow human instructions while navigating complex environments.<n>We propose Vision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles (UAVs) to execute language-guided flight.
arXiv Detail & Related papers (2025-06-12T14:40:50Z)
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks [24.690910258151693]
Existing models for embodied navigation fall short of serving as practical generalists in the real world.<n>We present Uni-NaVid, the first video-based vision-language-action model designed to unify diverse embodied navigation tasks.<n>Uni-NaVid achieves this by the input and output data configurations for all commonly used embodied navigation tasks.
arXiv Detail & Related papers (2024-12-09T05:55:55Z)
Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models [8.668211481067457]
Co-NavGPT is a novel framework that integrates a Vision Language Model (VLM) as a global planner.<n>Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map.<n>The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration.
arXiv Detail & Related papers (2023-10-11T23:17:43Z)
ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset. It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z)
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z)
ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation [75.13546386761153]
We present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC) ESC transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience. Experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines.
arXiv Detail & Related papers (2023-01-30T18:37:32Z)
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories. We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z)
Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes. Our proposed method combines visual features and 3D spatial representations to learn navigation policy. Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.