Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology
- URL: http://arxiv.org/abs/2410.07087v2
- Date: Thu, 10 Oct 2024 05:02:04 GMT
- Title: Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology
- Authors: Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, Si Liu,
- Abstract summary: Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings.
We propose solutions from three perspectives: platform, benchmark, and methodology.
- Score: 38.2096731046639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing agents capable of navigating to a target location based on language instructions and visual information, known as vision-language navigation (VLN), has attracted widespread interest. Most research has focused on ground-based agents, while UAV-based VLN remains relatively underexplored. Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings, relying on predefined discrete action spaces and neglecting the inherent disparities in agent movement dynamics and the complexity of navigation tasks between ground and aerial environments. To address these disparities and challenges, we propose solutions from three perspectives: platform, benchmark, and methodology. To enable realistic UAV trajectory simulation in VLN tasks, we propose the OpenUAV platform, which features diverse environments, realistic flight control, and extensive algorithmic support. We further construct a target-oriented VLN dataset consisting of approximately 12k trajectories on this platform, serving as the first dataset specifically designed for realistic UAV VLN tasks. To tackle the challenges posed by complex aerial environments, we propose an assistant-guided UAV object search benchmark called UAV-Need-Help, which provides varying levels of guidance information to help UAVs better accomplish realistic VLN tasks. We also propose a UAV navigation LLM that, given multi-view images, task descriptions, and assistant instructions, leverages the multimodal understanding capabilities of the MLLM to jointly process visual and textual information, and performs hierarchical trajectory generation. The evaluation results of our method significantly outperform the baseline models, while there remains a considerable gap between our results and those achieved by human operators, underscoring the challenge presented by the UAV-Need-Help task.
Related papers
- Integrating Large Language Models for UAV Control in Simulated Environments: A Modular Interaction Approach [0.3495246564946556]
This study explores the application of Large Language Models in UAV control.
By enabling UAVs to interpret and respond to natural language commands, LLMs simplify the UAV control and usage.
The paper discusses several key areas where LLMs can impact UAV technology, including autonomous decision-making, dynamic mission planning, enhanced situational awareness, and improved safety protocols.
arXiv Detail & Related papers (2024-10-23T06:56:53Z) - Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning [48.33405770713208]
We propose an end-to-end framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction.
We develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs.
Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method.
arXiv Detail & Related papers (2024-10-11T03:54:48Z) - Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs.
We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z) - OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for
Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach.
We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z) - AerialVLN: Vision-and-Language Navigation for UAVs [23.40363176320464]
We propose a new task named AerialVLN, which is UAV-based and towards outdoor environments.
We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios.
We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task.
arXiv Detail & Related papers (2023-08-13T09:55:04Z) - Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments [20.69412701553767]
Unmanned Aerial Vehicles (UAVs) rely on satellite systems for stable positioning.
In such situations, vision-based techniques can serve as an alternative, ensuring the self-positioning capability of UAVs.
This paper presents a new dataset, DenseUAV, which is the first publicly available dataset designed for the UAV self-positioning task.
arXiv Detail & Related papers (2022-01-23T07:18:55Z) - Trajectory Design for UAV-Based Internet-of-Things Data Collection: A
Deep Reinforcement Learning Approach [93.67588414950656]
In this paper, we investigate an unmanned aerial vehicle (UAV)-assisted Internet-of-Things (IoT) system in a 3D environment.
We present a TD3-based trajectory design for completion time minimization (TD3-TDCTM) algorithm.
Our simulation results show the superiority of the proposed TD3-TDCTM algorithm over three conventional non-learning based baseline methods.
arXiv Detail & Related papers (2021-07-23T03:33:29Z) - A Multi-UAV System for Exploration and Target Finding in Cluttered and
GPS-Denied Environments [68.31522961125589]
We propose a framework for a team of UAVs to cooperatively explore and find a target in complex GPS-denied environments with obstacles.
The team of UAVs autonomously navigates, explores, detects, and finds the target in a cluttered environment with a known map.
Results indicate that the proposed multi-UAV system has improvements in terms of time-cost, the proportion of search area surveyed, as well as successful rates for search and rescue missions.
arXiv Detail & Related papers (2021-07-19T12:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.