Related papers: Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

URL: http://arxiv.org/abs/2410.07087v2
Date: Thu, 10 Oct 2024 05:02:04 GMT
Title: Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology
Authors: Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, Si Liu,
Abstract summary: Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings. We propose solutions from three perspectives: platform, benchmark, and methodology.
Score: 38.2096731046639
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing agents capable of navigating to a target location based on language instructions and visual information, known as vision-language navigation (VLN), has attracted widespread interest. Most research has focused on ground-based agents, while UAV-based VLN remains relatively underexplored. Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings, relying on predefined discrete action spaces and neglecting the inherent disparities in agent movement dynamics and the complexity of navigation tasks between ground and aerial environments. To address these disparities and challenges, we propose solutions from three perspectives: platform, benchmark, and methodology. To enable realistic UAV trajectory simulation in VLN tasks, we propose the OpenUAV platform, which features diverse environments, realistic flight control, and extensive algorithmic support. We further construct a target-oriented VLN dataset consisting of approximately 12k trajectories on this platform, serving as the first dataset specifically designed for realistic UAV VLN tasks. To tackle the challenges posed by complex aerial environments, we propose an assistant-guided UAV object search benchmark called UAV-Need-Help, which provides varying levels of guidance information to help UAVs better accomplish realistic VLN tasks. We also propose a UAV navigation LLM that, given multi-view images, task descriptions, and assistant instructions, leverages the multimodal understanding capabilities of the MLLM to jointly process visual and textual information, and performs hierarchical trajectory generation. The evaluation results of our method significantly outperform the baseline models, while there remains a considerable gap between our results and those achieved by human operators, underscoring the challenge presented by the UAV-Need-Help task.

Related papers

UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents [5.414995940540323]
UAV-ON is a benchmark for large-scale Object Goal Navigation (NavObject) by aerial agents in open-world environments.<n>It comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts.<n>It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors.
arXiv Detail & Related papers (2025-08-01T03:23:06Z)
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning [39.07541452390107]
Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction.<n>We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach.<n>We present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control.
arXiv Detail & Related papers (2025-05-21T16:31:28Z)
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs [0.0]
A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments. We propose UAV-VLN, a novel end-to-end Vision-Language Navigation framework for Unmanned Aerial Vehicles (UAVs) Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments.
arXiv Detail & Related papers (2025-04-30T08:40:47Z)
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation [1.8742629471785477]
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans.
arXiv Detail & Related papers (2025-01-09T07:15:59Z)
UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility [33.73170899086857]
Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains. This paper explores the integration of large language models (LLMs) and UAVs. It categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge.
arXiv Detail & Related papers (2025-01-04T17:32:12Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN. It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
Integrating Large Language Models for UAV Control in Simulated Environments: A Modular Interaction Approach [0.3495246564946556]
This study explores the application of Large Language Models in UAV control. By enabling UAVs to interpret and respond to natural language commands, LLMs simplify the UAV control and usage. The paper discusses several key areas where LLMs can impact UAV technology, including autonomous decision-making, dynamic mission planning, enhanced situational awareness, and improved safety protocols.
arXiv Detail & Related papers (2024-10-23T06:56:53Z)
Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning [48.33405770713208]
We propose an end-to-end framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction. We develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs. Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method.
arXiv Detail & Related papers (2024-10-11T03:54:48Z)
Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs. We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z)
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques. To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z)
Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach. We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z)
AerialVLN: Vision-and-Language Navigation for UAVs [23.40363176320464]
We propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task.
arXiv Detail & Related papers (2023-08-13T09:55:04Z)
Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments [20.69412701553767]
Unmanned Aerial Vehicles (UAVs) rely on satellite systems for stable positioning. In such situations, vision-based techniques can serve as an alternative, ensuring the self-positioning capability of UAVs. This paper presents a new dataset, DenseUAV, which is the first publicly available dataset designed for the UAV self-positioning task.
arXiv Detail & Related papers (2022-01-23T07:18:55Z)
A Multi-UAV System for Exploration and Target Finding in Cluttered and GPS-Denied Environments [68.31522961125589]
We propose a framework for a team of UAVs to cooperatively explore and find a target in complex GPS-denied environments with obstacles. The team of UAVs autonomously navigates, explores, detects, and finds the target in a cluttered environment with a known map. Results indicate that the proposed multi-UAV system has improvements in terms of time-cost, the proportion of search area surveyed, as well as successful rates for search and rescue missions.
arXiv Detail & Related papers (2021-07-19T12:54:04Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.