Fugu-MT 論文翻訳(概要): Can Multimodal Large Language Models Truly Understand Small Objects?

論文の概要: Can Multimodal Large Language Models Truly Understand Small Objects?

arxiv url: http://arxiv.org/abs/2604.22884v1
Date: Fri, 24 Apr 2026 08:13:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.029108
Title: Can Multimodal Large Language Models Truly Understand Small Objects?
Title（参考訳）: マルチモーダルな大言語モデルは小さなオブジェクトを真に理解できるか?
Authors: Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao, Tao Chen, Peng Ye,
Abstract要約: 我々は、既存のMLLMの小さなオブジェクト理解能力を調べるための、最初の、そして包括的なベンチマークであるSOUBenchを紹介する。我々は,15種類の最先端MLLMの総合評価を行い,その弱さを明らかにする。さらに,11,226組のVQAペアを持つマルチモーダルトレーニングデータセットであるSOU-Trainを開発し,MLLMのSOU性能を向上させる。
参考スコア（独自算出の注目度）: 9.082671977975483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は,画像・ビデオ解析,数学・物理オリンピックなどの多様な理解タスクにおいて有望な可能性を示している。しかし、それらは空白のままであり、Small Object Understanding (SOU)タスクのために探索されていない。このギャップを埋めるために、既存のMLLMの小さなオブジェクト理解能力を探索する最初の、そして包括的なベンチマークであるSOUBenchを紹介します。具体的には,まず,18,204組のVQAペアと6つの関連するサブタスク,および3つの支配的シナリオ(ドライビング,エアリアル,アンダーウォーター)を備えた新しいSOU-VQA評価データセットを構築する。そして,15種類の最先端MLLMの総合評価を行い,その弱点を明らかにする。さらに,11,226組のVQAペアを持つマルチモーダルトレーニングデータセットであるSOU-Trainを開発し,MLLMのSOU性能を向上させる。最新のMLLMの微調整を監督することにより、SOU-Trainが最新のMLLMの小さな物体を理解する能力を効果的に向上できることを実証する。総合的な実験結果から,提案した SOUBench と SOU-VQA と SOU-Train のデータセットは,より小さなオブジェクト理解能力を持つモデルをさらに発展させる上で,コミュニティにとって重要な実証的基盤となることが示されている。データセットとコード:https://github.com/Hanfj-X/SOU.com

論文の概要: Can Multimodal Large Language Models Truly Understand Small Objects?

関連論文リスト