Related papers: CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

URL: http://arxiv.org/abs/2406.14240v1
Date: Thu, 20 Jun 2024 12:08:27 GMT
Title: CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information
Authors: Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, Nakamasa Inoue,
Abstract summary: We introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories. We benchmark the latest aerial navigation baselines and our proposed model on the CityNav dataset.
Score: 25.51740922661166
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. While substantial progress has been made in understanding these interactive modalities in ground-level navigation, aerial navigation remains largely underexplored. This is primarily due to the scarcity of resources suitable for real-world, city-scale aerial navigation studies. To bridge this gap, we introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We benchmark the latest aerial navigation baselines and our proposed model on the CityNav dataset. The results using this dataset reveal the following key findings: (i) Our aerial agent models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies; (ii) The integration of a 2D spatial map significantly enhances navigation efficiency at city scale. Our dataset and code are available at https://water-cookie.github.io/city-nav-proj/

Related papers

SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes [15.178229677519063]
We present the Semantics-aware dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON)<n>It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents.<n>We offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task.
arXiv Detail & Related papers (2025-05-24T21:37:06Z)
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory [39.76840258489023]
Aerial vision-and-language navigation (VLN) requires drones to interpret natural language instructions and navigate complex urban environments.<n>We propose textbfCityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN.
arXiv Detail & Related papers (2025-05-08T20:01:35Z)
Learning to Drive Anywhere with Model-Based Reannotation [49.80796496905606]
We develop a framework for generalizable visual navigation policies for robots.<n>We leverage passively collected data, including crowd-sourced teleoperation data and unlabeled YouTube videos.<n>This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints.
arXiv Detail & Related papers (2025-05-08T18:43:39Z)
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks [24.690910258151693]
Existing models for embodied navigation fall short of serving as practical generalists in the real world.<n>We present Uni-NaVid, the first video-based vision-language-action model designed to unify diverse embodied navigation tasks.<n>Uni-NaVid achieves this by the input and output data configurations for all commonly used embodied navigation tasks.
arXiv Detail & Related papers (2024-12-09T05:55:55Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.<n>It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.<n>UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation [15.628308089720269]
Vision-and-Language Navigation (VLN) aims to enable embodied agents to navigate in complicated visual environments through natural language commands. We propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. We build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. To train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes.
arXiv Detail & Related papers (2024-11-13T12:51:49Z)
GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. Our framework constructs a novel map representation based on 3D Gaussian Splatting (3DGS) Our framework demonstrates a significant leap in performance, evidenced by an increase in Success weighted by Path Length (SPL) from 0.252 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z)
VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model [28.79971953667143]
VoroNav is a semantic exploration framework to extract exploratory paths and planning nodes from a semantic map constructed in real time. By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model.
arXiv Detail & Related papers (2024-01-05T08:05:07Z)
Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation. Our method significantly outperforms the state of the art on the challenging MP3D dataset. We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z)
Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning. We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z)
Learning Navigational Visual Representations with Semantic Map Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps. Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z)
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints [94.60414567852536]
Long-range navigation requires both planning and reasoning about local traversability. We propose a learning-based approach that integrates learning and planning. ViKiNG can leverage its image-based learned controller and goal-directed to navigate to goals up to 3 kilometers away.
arXiv Detail & Related papers (2022-02-23T02:14:23Z)
SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots. Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z)
Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment. Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.