MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
- URL: http://arxiv.org/abs/2512.23219v1
- Date: Mon, 29 Dec 2025 05:49:54 GMT
- Title: MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
- Authors: Shiqi Dai, Zizhi Ma, Zhicong Luo, Xuesong Yang, Yibin Huang, Wanyue Zhang, Chi Chen, Zonghao Guo, Wang Xu, Yufei Sun, Maosong Sun,
- Abstract summary: We present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios.<n> MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets.<n>Our experiments reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios.
- Score: 35.75859316774549
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
Related papers
- Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation [38.19842131198389]
Vision-Language Models (VLMs) leveraging their powerful visual perception and reasoning capabilities have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.<n>However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored.<n>We introduce SpatialSky-Bench, a benchmark designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation.
arXiv Detail & Related papers (2025-11-17T11:39:20Z) - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z) - Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond [116.65158801881984]
We introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs.<n>We develop a unified and interpretable FER foundation model termed UniFER-7B.
arXiv Detail & Related papers (2025-11-01T03:53:00Z) - HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs [8.18063726177317]
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences.<n>We propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning.
arXiv Detail & Related papers (2025-07-23T10:41:46Z) - EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility [33.73170899086857]
Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains.<n>This paper explores the integration of large language models (LLMs) and UAVs.<n>It categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge.
arXiv Detail & Related papers (2025-01-04T17:32:12Z) - Benchmarking Large and Small MLLMs [71.78055760441256]
Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content.<n>However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications.<n>Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
arXiv Detail & Related papers (2025-01-04T07:44:49Z) - Exploring the Reasoning Abilities of Multimodal Large Language Models
(MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning [44.12214030785711]
We review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of Multimodal Large Language Models (MLLMs)
We introduce recent trends in applications of MLLMs on reasoning-intensive tasks and discuss current practices and future directions.
arXiv Detail & Related papers (2024-01-10T15:29:21Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [111.51612340032052]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.<n>This paper presents the first comprehensive MLLM Evaluation benchmark MME.<n>It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.