Related papers: UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

URL: http://arxiv.org/abs/2512.09607v1
Date: Wed, 10 Dec 2025 12:54:04 GMT
Title: UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories
Authors: Yanghong Mei, Yirong Yang, Longteng Guo, Qunbo Wang, Ming-Ming Yu, Xingjian He, Wenjun Wu, Jing Liu,
Abstract summary: UrbanNav is a framework that trains embodied agents to follow free-form language instructions in diverse urban settings.<n>Our model learns robust navigation policies to tackle complex urban scenarios.<n>Results show that UrbanNav significantly outperforms existing methods.
Score: 17.380146582395145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.

Related papers

UrbanVLA: A Vision-Language-Action Model for Urban Micromobility [29.195408718461845]
Urban micromobility applications demand reliable navigation across large-scale urban environments.<n>We propose UrbanVLA, a framework designed for scalable urban navigation.<n>We show that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task on MetaUrban.
arXiv Detail & Related papers (2025-10-27T17:46:43Z)
UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos [64.22243628420799]
We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes.<n>Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes.<n>Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes.
arXiv Detail & Related papers (2025-10-16T17:42:34Z)
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory [39.76840258489023]
Aerial vision-and-language navigation (VLN) requires drones to interpret natural language instructions and navigate complex urban environments.<n>We propose textbfCityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN.
arXiv Detail & Related papers (2025-05-08T20:01:35Z)
Learning to Drive Anywhere with Model-Based Reannotation [49.80796496905606]
We develop a framework for generalizable visual navigation policies for robots.<n>We leverage passively collected data, including crowd-sourced teleoperation data and unlabeled YouTube videos.<n>This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints.
arXiv Detail & Related papers (2025-05-08T18:43:39Z)
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM [55.79954652783797]
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions.<n>Previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles.<n>We propose NavRAG, a retrieval-augmented generation framework that generates user demand instructions for VLN.
arXiv Detail & Related papers (2025-02-16T14:17:36Z)
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos [11.912608309403359]
We propose a scalable, data-driven approach for human-like urban navigation.<n>We train agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web.<n>Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios.
arXiv Detail & Related papers (2024-11-26T19:02:20Z)
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation [15.628308089720269]
Vision-and-Language Navigation (VLN) aims to enable embodied agents to navigate in complicated visual environments through natural language commands. We propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. We build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. To train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes.
arXiv Detail & Related papers (2024-11-13T12:51:49Z)
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation [25.51740922661166]
We introduce CityNav, the first large-scale real-world dataset for aerial VLN.<n>Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description.<n>We provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation.
arXiv Detail & Related papers (2024-06-20T12:08:27Z)
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments. Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z)
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories. We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z)
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation [71.67507925788577]
This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
arXiv Detail & Related papers (2020-07-01T04:29:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.