Fugu-MT 論文翻訳(概要): UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

論文の概要: UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

arxiv url: http://arxiv.org/abs/2603.14336v1
Date: Sun, 15 Mar 2026 12:04:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.756018
Title: UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding
Title（参考訳）: UAVBenchとUAVIT-1M:低高度UAVビジョンランゲージ理解のためのMLLMのベンチマークと強化
Authors: Yang Zhan, Yuan Yuan,
Abstract要約: UAVBenchとUAVIT-1Mは低高度視覚言語タスクにおけるMLLMの能力の評価と改善を目的としている。 UAVBenchは、43の試験ユニットと、画像レベルと領域レベルの10タスクにわたる966kの高品質なデータサンプルで構成されている。 UAVIT-1Mは、約124万の多様な命令で構成され、789万のマルチシーン画像と、11の異なるタスクを持つ約2,000種類の空間解像度をカバーしている。
参考スコア（独自算出の注目度）: 4.817647738745087
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: https://UAVBench.github.io/)
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、自然画像や衛星リモートセンシング画像において大きな進歩を遂げている。しかし、低高度ドローンのシナリオを理解することは依然として困難である。既存のデータセットは主に、現実の低高度UAVアプリケーションにおけるMLLMの能力を十分に評価できないいくつかの特定の低高度視覚タスクに焦点を当てている。そこで我々は,総合的なベンチマークであるUAVBenchと,低高度視覚言語タスクにおけるMLLMの能力の評価と改善を目的とした大規模インストラクションチューニングデータセットであるUAVIT-1Mを紹介する。 UAVBenchは、43の試験ユニットと、画像レベルと領域レベルの10タスクにわたる966kの高品質なデータサンプルで構成されている。 UAVIT-1Mは、約124万の多様な命令で構成され、789万のマルチシーン画像と、11の異なるタスクを持つ約2,000種類の空間解像度をカバーしている。 UAVBenchとUAVIT-1Mは、純粋に現実世界の視覚イメージと豊富な気象条件を備え、高品質の確認を手作業で行う。 UAVBenchを用いた11種類の最先端MLLMの詳細な分析により、オープンソースMLLMは、クローズドソースMLLMに遅れて、低高度のビジュアルコンテンツに関する正確な会話を生成できないことが明らかになった。大規模な実験により、UAVIT-1M上の細調整のオープンソースMLLMは、このギャップに顕著に対処することが示された。私たちのコントリビューションは、現在のMLLMと低高度UAVリアルタイムアプリケーション要求のギャップを埋める道を開くものです。 (プロジェクトページ:https://UAVBench.github.io/)

論文の概要: UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

関連論文リスト