Fugu-MT 論文翻訳(概要): MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding

論文の概要: MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding

arxiv url: http://arxiv.org/abs/2503.09348v1
Date: Wed, 12 Mar 2025 12:49:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-13 21:17:52.761703
Title: MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Title（参考訳）: MOAT: 能力統合とインストラクショングラウンド化のためのLMMの評価
Authors: Zhoutong Ye, Mingze Sun, Huan-ang Gao, Chun Yu, Yuanchun Shi,
Abstract要約: 大規模マルチモーダルモデル(LMM)は、視覚言語(VL)タスクにおけるジェネラリストとして大きな可能性を示している。最先端のLMMと人間のパフォーマンスの間には、依然として大きなギャップがある。 LMM に挑戦する複雑な実世界の VL タスクのベンチマークである MOAT を提案する。
参考スコア（独自算出の注目度）: 27.140576967695413
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs' capability to count. Code and data are available at https://cambrian-yzt.github.io/MOAT.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は、視覚言語(VL)タスクにおけるジェネラリストとして大きな可能性を示している。しかし、基本的なVL機能の組み合わせを必要とする複雑なタスクや、複雑な命令の接地を含むタスクに関しては、最先端のLMMと人間のパフォーマンスの間には大きなギャップが残っている。ヒトとLMMのギャップとその原因を徹底的に調査するために,複雑な実世界のVLタスクを持つ多種多様なベンチマークMOATを提案する。特に、MOATのタスクは、テキストの読み上げ、カウント、空間的関係の理解、テキストと視覚的指示の接地など、基本的なVL機能を統合することで、汎用的な問題を解決するためにLMMを必要とする。これらすべての能力は、10の基本的なVL能力を含む分類に適合し、MOATはLMMの強みと弱みを詳細に把握することができる。さらに、MOATはLMMが複雑なテキストと視覚的命令をグラウンドする能力を明確に評価する最初のベンチマークであり、これは現実世界の多くのアプリケーションに必須である。我々は、MOATで20以上のプロプライエタリでオープンソースのLMMと人間を評価し、人間が82.7%の精度を達成し、最高のパフォーマンスのLMM(OpenAI o1)は38.8%しか達成していないことがわかった。今後のモデル開発を導くため,本研究の結果の共通する傾向を分析し,LMMと人間の間で観測されるパフォーマンスギャップの原因を考察し,VL能力が複雑なタスクのボトルネックとなるか,テスト時間スケーリングがMOATの性能を改善するか,LMMの能力に悪影響を及ぼすか,といった点に注目した。コードとデータはhttps://cambrian-yzt.github.io/MOAT.comで公開されている。

論文の概要: MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding

関連論文リスト