Related papers: Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

URL: http://arxiv.org/abs/2511.13269v1
Date: Mon, 17 Nov 2025 11:39:20 GMT
Title: Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Authors: Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding,
Abstract summary: Vision-Language Models (VLMs) leveraging their powerful visual perception and reasoning capabilities have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.<n>However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored.<n>We introduce SpatialSky-Bench, a benchmark designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation.
Score: 38.19842131198389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.

Related papers

MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios? [35.75859316774549]
We present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios.<n> MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets.<n>Our experiments reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios.
arXiv Detail & Related papers (2025-12-29T05:49:54Z)
IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments [21.821075450697027]
Vision-IndoorLanguage Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations.<n> indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces.<n>We introduce textbfIndoorUAV, a novel benchmark and method specifically tailored for VLN with indoor UAVs.
arXiv Detail & Related papers (2025-12-22T04:42:35Z)
UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data [47.317955428393134]
We introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding.<n>It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions.<n>Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting.
arXiv Detail & Related papers (2025-11-27T12:30:28Z)
AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z)
A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles [74.8162337823142]
MM-UAV is the first large-scale benchmark for Multi-Modal UAV Tracking.<n>The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames.<n>Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework.
arXiv Detail & Related papers (2025-11-23T08:42:17Z)
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models [75.64836077468722]
Vision language models (VLMs) excel in 2D semantic visual understanding, but their ability to quantitatively reason about 3D spatial relationships remains under-explored.<n>We propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs.<n>We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability.
arXiv Detail & Related papers (2025-09-22T12:08:12Z)
Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration [39.84712917520324]
We introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery.<n>The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics.<n>Our analysis highlights significant limitations in the mathematical reasoning capabilities of current vision-language models.
arXiv Detail & Related papers (2025-09-12T08:46:49Z)
UAVScenes: A Multi-Modal Dataset for UAVs [45.752766099526525]
UAVScenes is a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities.<n>We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds.<n>These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis.
arXiv Detail & Related papers (2025-07-30T06:29:52Z)
LLM Meets the Sky: Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks [57.27815890269697]
This work focuses on maximizing the secrecy rate in heterogeneous UAV networks (HetUAVNs) under energy constraints.<n>We introduce a Large Language Model (LLM)-guided multi-agent learning approach.<n>Results show that our method outperforms existing baselines in secrecy and energy efficiency.
arXiv Detail & Related papers (2025-07-23T04:22:57Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV [58.89234732689013]
CODrone is a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions.<n>It also serves as a new benchmark designed to align with downstream task requirements.<n>We conduct a series of experiments based on 22 classical or SOTA methods to rigorously evaluate CODrone.
arXiv Detail & Related papers (2025-04-28T17:56:02Z)
Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark [6.693781685584959]
Low-altitude, Multi-view UAV AVL presents greater challenges due to extreme viewpoint changes.<n>This benchmark revealed the challenges of Low-altitude, Multi-view UAV AVL and provided valuable guidance for future research.
arXiv Detail & Related papers (2025-03-12T03:29:27Z)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [126.02606196101259]
Sa2VA is a comprehensive, unified model for dense grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL.
arXiv Detail & Related papers (2025-01-07T18:58:54Z)
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology [38.2096731046639]
Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings. We propose solutions from three perspectives: platform, benchmark, and methodology.
arXiv Detail & Related papers (2024-10-09T17:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.