Related papers: AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

URL: http://arxiv.org/abs/2601.03707v1
Date: Wed, 07 Jan 2026 08:46:09 GMT
Title: AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions
Authors: Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Wenhao Lu, Renxin Zhong,
Abstract summary: We propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data.<n>We also introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization.
Score: 6.369522034276603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.

Related papers

IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments [21.821075450697027]
Vision-IndoorLanguage Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations.<n> indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces.<n>We introduce textbfIndoorUAV, a novel benchmark and method specifically tailored for VLN with indoor UAVs.
arXiv Detail & Related papers (2025-12-22T04:42:35Z)
History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation [64.51891404034164]
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments.<n>Existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects.<n>This work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline.
arXiv Detail & Related papers (2025-12-16T09:16:07Z)
SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments [7.251041314934871]
Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability.<n>This paper introduces SkyVLN, a novel framework integrating verbalize vision-and-language navigation (VLN) with Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments.
arXiv Detail & Related papers (2025-07-09T05:38:32Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models [11.286340789648813]
Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection.<n>We propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities.<n>We show that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22% higher success rate than the strongest baseline in unseen environments.
arXiv Detail & Related papers (2025-05-19T08:21:20Z)
More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV [58.89234732689013]
CODrone is a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions.<n>It also serves as a new benchmark designed to align with downstream task requirements.<n>We conduct a series of experiments based on 22 classical or SOTA methods to rigorously evaluate CODrone.
arXiv Detail & Related papers (2025-04-28T17:56:02Z)
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation [49.697035403548966]
Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI.<n>We propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN.<n>We construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes.
arXiv Detail & Related papers (2025-02-25T09:57:18Z)
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation [25.51740922661166]
We introduce CityNav, the first large-scale real-world dataset for aerial VLN.<n>Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description.<n>We provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation.
arXiv Detail & Related papers (2024-06-20T12:08:27Z)
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation. NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z)
AerialVLN: Vision-and-Language Navigation for UAVs [23.40363176320464]
We propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task.
arXiv Detail & Related papers (2023-08-13T09:55:04Z)
Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Recent methods explore pretraining to improve generalization of VLN agents. We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.<n>One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.<n>We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.