CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory
- URL: http://arxiv.org/abs/2505.05622v1
- Date: Thu, 08 May 2025 20:01:35 GMT
- Title: CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory
- Authors: Weichen Zhang, Chen Gao, Shiquan Yu, Ruiying Peng, Baining Zhao, Qian Zhang, Jinqiang Cui, Xinlei Chen, Yong Li,
- Abstract summary: Aerial vision-and-language navigation (VLN) requires drones to interpret natural language instructions and navigate complex urban environments.<n>We propose textbfCityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN.
- Score: 39.76840258489023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose \textbf{CityNavAgent}, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at \href{https://github.com/VinceOuti/CityNavAgent}{link}.
Related papers
- UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents [5.414995940540323]
UAV-ON is a benchmark for large-scale Object Goal Navigation (NavObject) by aerial agents in open-world environments.<n>It comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts.<n>It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors.
arXiv Detail & Related papers (2025-08-01T03:23:06Z) - NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation [15.628308089720269]
Vision-and-Language Navigation (VLN) aims to enable embodied agents to navigate in complicated visual environments through natural language commands.
We propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model.
We build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks.
To train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes.
arXiv Detail & Related papers (2024-11-13T12:51:49Z) - Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning [48.33405770713208]
We propose an end-to-end framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction.
We develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs.
Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method.
arXiv Detail & Related papers (2024-10-11T03:54:48Z) - CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information [25.51740922661166]
Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues.
We introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in 3D environments of real cities.
CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator.
arXiv Detail & Related papers (2024-06-20T12:08:27Z) - SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks.
SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z) - AerialVLN: Vision-and-Language Navigation for UAVs [23.40363176320464]
We propose a new task named AerialVLN, which is UAV-based and towards outdoor environments.
We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios.
We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task.
arXiv Detail & Related papers (2023-08-13T09:55:04Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions.
By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment.
Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.