Fugu-MT 論文翻訳(概要): Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

論文の概要: Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

arxiv url: http://arxiv.org/abs/2512.07302v1
Date: Mon, 08 Dec 2025 08:44:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.788934
Title: Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts
Title（参考訳）: 高精度UAV画像認識に向けて:より強力なタスクプロンプトを用いた視覚言語モデルの誘導
Authors: Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao,
Abstract要約: 本稿では,UAV画像認識におけるタスクプロンプト向上のための最初のエージェントフレームワークであるAerialVPを紹介する。 AerialVPは、UAV画像から多次元補助情報を積極的に抽出し、タスクプロンプトを強化する。 AerialSenseは、Aerial Visual Reasoning、Aerial Visual Question Answering、Aerial Visual Groundingタスクを含むUAVイメージ知覚のベンチマークである。
参考スコア（独自算出の注目度）: 2.3160863001888914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
Abstract（参考訳）: 既存のVLMに基づく画像認識手法は、一般に、ユーザが提供するテキストタスクプロンプトに基づいて、画像コンテンツを抽出・分析するパラダイムに従っている。しかし、このような手法はUAV画像に適用した場合の制限に直面し、ターゲットの混乱、スケールのバリエーション、複雑な背景などの課題が提示される。これらの課題は、VLMによる画像内容の理解が視覚的トークンとテキストトークンのセマンティックアライメントに依存するためである。タスクプロンプトが単純化され、画像内容が複雑になると、効果的なアライメントを達成することが難しくなり、タスク関連情報に集中する能力が制限される。この問題に対処するため,UAV画像認識におけるタスクプロンプト強化のための最初のエージェントフレームワークであるAerialVPを紹介した。 AerialVPはUAV画像から多次元補助情報を積極的に抽出してタスクプロンプトを強化し、従来のVLMベースのアプローチの限界を克服する。具体的には、(1)タスクのタイプと強化ニーズを特定するためのタスクプロンプトを分析すること、(2)ツールリポジトリから適切なツールを選択すること、(3)分析と選択されたツールに基づいて強化されたタスクプロンプトを生成すること、の3段階を含む。 AerialVPを評価するために、Aerial Visual Reasoning、Aerial Visual Question Answering、Aerial Visual Groundingタスクを含むUAV画像知覚のための包括的なベンチマークであるAerialSenseを紹介する。 AerialSenseは、様々な解像度、照明条件、都市と自然の両方でモデルの一般化と性能を評価するための標準化された基盤を提供する。実験の結果,AerialVPはタスクプロンプト誘導を著しく向上させ,オープンソースのVLMとプロプライエタリなVLMの両方で安定かつ実質的な性能向上をもたらすことが示された。私たちの仕事はhttps://github.com/lostwolves/AerialVPで公開されます。

論文の概要: Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

関連論文リスト