Fugu-MT 論文翻訳(概要): Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

論文の概要: Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

arxiv url: http://arxiv.org/abs/2509.10059v1
Date: Fri, 12 Sep 2025 08:46:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-15 16:03:08.021066
Title: Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
Title（参考訳）: 航空機画像に埋め込まれたマルチモーダル数学的推論:ベンチマーク,解析,探索
Authors: Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li, Yiping Ke, Xue Jiang, Wayne Zhang,
Abstract要約: 航空機画像におけるマルチモーダルな数学的推論を厳格に評価する最初のベンチマークであるAVI-Mathを紹介する。データセットは、UAVビューから取得した高品質な車両関連質問3,773件からなり、6つの数学科目と20のトピックをカバーしている。我々の分析は、現在の視覚言語モデルにおける数学的推論能力の重大な制限を強調している。
参考スコア（独自算出の注目度）: 39.84712917520324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math
Abstract（参考訳）: 無人航空機(UAV)を用いた遠隔センシングでは,正確な距離・面積計算,軌道推定,空間解析などのタスクには数学的推論が重要であるが,現在の視覚言語モデル(VLM)はこの領域では十分にテストされていない。このギャップに対処するために、AVI-Mathは、航空機画像におけるマルチモーダルな数学的推論を厳格に評価する最初のベンチマークである。データセットは、UAVビューから取得した高品質な車両関連質問3,773件からなり、6つの数学科目と20のトピックをカバーしている。様々な高度と複数のUAV角度から収集されたデータは、実際のUAVシナリオを反映し、構築された数学的問題の多様性と複雑さを保証する。本稿では,従来のマルチモーダルベンチマークで成功したにもかかわらず,これらのモデルがAVI-Mathの推論タスクと競合することを示す。我々の詳細な分析は、現在のVLMの数学的推論能力の重大な限界を浮き彫りにして、今後の研究への道筋を示唆している。さらに,AVI-Mathにおける推論問題に対処する上での課題として,チェーン・オブ・ソート・プロンプトと微調整技術の利用について検討する。我々の発見は, 数学的推論におけるVLMの限界を明らかにするだけでなく, 現実の応用において, UAVベースの信頼性の高いVLMを推し進めるための貴重な洞察を提供する。コードとデータセットはhttps://github.com/VisionXLab/avi-mathで公開される。

論文の概要: Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

関連論文リスト