UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
- URL: http://arxiv.org/abs/2510.23576v1
- Date: Mon, 27 Oct 2025 17:46:43 GMT
- Title: UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
- Authors: Anqi Li, Zhiyong Wang, Jiazhao Zhang, Minghan Li, Yunpeng Qi, Zhibo Chen, Zhizheng Zhang, He Wang,
- Abstract summary: Urban micromobility applications demand reliable navigation across large-scale urban environments.<n>We propose UrbanVLA, a framework designed for scalable urban navigation.<n>We show that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task on MetaUrban.
- Score: 29.195408718461845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is particularly challenging due to the dynamic and unstructured nature of real-world city areas, yet most existing navigation methods remain tailored to short-scale and controllable scenarios. Effective urban micromobility requires two complementary levels of navigation skills: low-level capabilities such as point-goal reaching and obstacle avoidance, and high-level capabilities, such as route-visual alignment. To this end, we propose UrbanVLA, a route-conditioned Vision-Language-Action (VLA) framework designed for scalable urban navigation. Our method explicitly aligns noisy route waypoints with visual observations during execution, and subsequently plans trajectories to drive the robot. To enable UrbanVLA to master both levels of navigation, we employ a two-stage training pipeline. The process begins with Supervised Fine-Tuning (SFT) using simulated environments and trajectories parsed from web videos. This is followed by Reinforcement Fine-Tuning (RFT) on a mixture of simulation and real-world data, which enhances the model's safety and adaptability in real-world settings. Experiments demonstrate that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task on MetaUrban. Furthermore, UrbanVLA achieves reliable real-world navigation, showcasing both scalability to large-scale urban environments and robustness against real-world uncertainties.
Related papers
- History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation [64.51891404034164]
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments.<n>Existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects.<n>This work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline.
arXiv Detail & Related papers (2025-12-16T09:16:07Z) - UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories [17.380146582395145]
UrbanNav is a framework that trains embodied agents to follow free-form language instructions in diverse urban settings.<n>Our model learns robust navigation policies to tackle complex urban scenarios.<n>Results show that UrbanNav significantly outperforms existing methods.
arXiv Detail & Related papers (2025-12-10T12:54:04Z) - CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory [39.76840258489023]
Aerial vision-and-language navigation (VLN) requires drones to interpret natural language instructions and navigate complex urban environments.<n>We propose textbfCityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN.
arXiv Detail & Related papers (2025-05-08T20:01:35Z) - Towards Autonomous Micromobility through Scalable Urban Simulation [52.749987132021324]
Current micromobility depends mostly on human manual operation (in-person or remote control)<n>In this work, we present a scalable urban simulation solution to advance autonomous micromobility.
arXiv Detail & Related papers (2025-05-01T17:52:29Z) - CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos [11.912608309403359]
We propose a scalable, data-driven approach for human-like urban navigation.<n>We train agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web.<n>Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios.
arXiv Detail & Related papers (2024-11-26T19:02:20Z) - MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility [52.0930915607703]
Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans.
Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system.
We present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research.
arXiv Detail & Related papers (2024-07-11T17:56:49Z) - Learning Robust Autonomous Navigation and Locomotion for Wheeled-Legged Robots [50.02055068660255]
Navigating urban environments poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation.
This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city.
Using model-free reinforcement learning (RL) techniques and privileged learning, we develop a versatile locomotion controller.
Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain.
arXiv Detail & Related papers (2024-05-03T00:29:20Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.