CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
- URL: http://arxiv.org/abs/2411.17820v3
- Date: Tue, 22 Apr 2025 01:16:08 GMT
- Title: CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
- Authors: Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, Chen Feng,
- Abstract summary: We propose a scalable, data-driven approach for human-like urban navigation.<n>We train agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web.<n>Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios.
- Score: 11.912608309403359
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. Project homepage is at https://ai4ce.github.io/CityWalker/.
Related papers
- CREStE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance [13.922655150502365]
CREStE learns representations and rewards for addressing the full mapless navigation problem.
We evaluate CREStE in kilometer-scale navigation tasks across six distinct urban environments.
arXiv Detail & Related papers (2025-03-05T21:42:46Z) - NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [24.689242976554482]
Navigating unfamiliar environments presents significant challenges for household robots.
Existing reinforcement learning methods cannot be directly transferred to new environments.
We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
arXiv Detail & Related papers (2025-02-19T17:27:47Z) - Hyp2Nav: Hyperbolic Planning and Curiosity for Crowd Navigation [58.574464340559466]
We advocate for hyperbolic learning to enable crowd navigation and we introduce Hyp2Nav.
Hyp2Nav leverages the intrinsic properties of hyperbolic geometry to better encode the hierarchical nature of decision-making processes in navigation tasks.
We propose a hyperbolic policy model and a hyperbolic curiosity module that results in effective social navigation, best success rates, and returns across multiple simulation settings.
arXiv Detail & Related papers (2024-07-18T14:40:33Z) - A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents [3.7314353481448337]
We developed a simulated navigation environment to train deep reinforcement learning agents.
We modulated the frequency of exposure to a shortcut and navigation cue, leading to the development of artificial agents with differing abilities.
We examined the encoded representations in artificial neural networks driving these agents, revealing intricate dynamics in representation learning.
arXiv Detail & Related papers (2024-07-03T18:27:26Z) - CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information [25.51740922661166]
Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues.
We introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in 3D environments of real cities.
CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator.
arXiv Detail & Related papers (2024-06-20T12:08:27Z) - Learning Robust Autonomous Navigation and Locomotion for Wheeled-Legged Robots [50.02055068660255]
Navigating urban environments poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation.
This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city.
Using model-free reinforcement learning (RL) techniques and privileged learning, we develop a versatile locomotion controller.
Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain.
arXiv Detail & Related papers (2024-05-03T00:29:20Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - GNM: A General Navigation Model to Drive Any Robot [67.40225397212717]
General goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots.
We analyze the necessary design decisions for effective data sharing across robots.
We deploy the trained GNM on a range of new robots, including an under quadrotor.
arXiv Detail & Related papers (2022-10-07T07:26:41Z) - Learning to Walk by Steering: Perceptive Quadrupedal Locomotion in
Dynamic Environments [25.366480092589022]
A quadrupedal robot must exhibit robust and agile walking behaviors in response to environmental clutter and moving obstacles.
We present a hierarchical learning framework, named PRELUDE, which decomposes the problem of perceptive locomotion into high-level decision-making.
We demonstrate the effectiveness of our approach in simulation and with hardware experiments.
arXiv Detail & Related papers (2022-09-19T17:55:07Z) - Human-Aware Robot Navigation via Reinforcement Learning with Hindsight
Experience Replay and Curriculum Learning [28.045441768064215]
Reinforcement learning approaches have shown superior ability in solving sequential decision making problems.
In this work, we consider the task of training an RL agent without employing the demonstration data.
We propose to incorporate the hindsight experience replay (HER) and curriculum learning (CL) techniques with RL to efficiently learn the optimal navigation policy in the dense crowd.
arXiv Detail & Related papers (2021-10-09T13:18:11Z) - Augmented reality navigation system for visual prosthesis [67.09251544230744]
We propose an augmented reality navigation system for visual prosthesis that incorporates a software of reactive navigation and path planning.
It consists on four steps: locating the subject on a map, planning the subject trajectory, showing it to the subject and re-planning without obstacles.
Results show how our augmented navigation system help navigation performance by reducing the time and distance to reach the goals, even significantly reducing the number of obstacles collisions.
arXiv Detail & Related papers (2021-09-30T09:41:40Z) - Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation.
This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z) - Adversarial Environment Generation for Learning to Navigate the Web [107.99759923626242]
One of the bottlenecks of training web navigation agents is providing a learnable curriculum of training environments.
We propose using Adversarial Environment Generation (AEG) to generate challenging web environments in which to train reinforcement learning (RL) agents.
We show that the navigator agent trained with our proposed Flexible b-PAIRED technique significantly outperforms competitive automatic curriculum generation baselines.
arXiv Detail & Related papers (2021-03-02T19:19:30Z) - On Embodied Visual Navigation in Real Environments Through Habitat [20.630139085937586]
Visual navigation models based on deep learning can learn effective policies when trained on large amounts of visual observations.
To deal with this limitation, several simulation platforms have been proposed in order to train visual navigation policies on virtual environments efficiently.
We show that our tool can effectively help to train and evaluate navigation policies on real-world observations without running navigation pisodes in the real world.
arXiv Detail & Related papers (2020-10-26T09:19:07Z) - Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment.
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.