Fugu-MT 論文翻訳(概要): MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

論文の概要: MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2410.10139v1
Date: Mon, 14 Oct 2024 04:15:00 GMT
ステータス: 翻訳完了
システム内更新日: 2024-10-30 02:54:14.421043
Title: MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Title（参考訳）: MMIE:大規模視覚言語モデルのための大規模マルチモーダルインターリーブ統合ベンチマーク
Authors: Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao,
Abstract要約: 我々は、LVLM(Large Vision-Language Models)において、インターリーブされたマルチモーダル理解と生成を評価するための大規模ベンチマークであるMMIEを紹介する。 MMIEは、数学、コーディング、物理学、文学、健康、芸術を含む3つのカテゴリ、12のフィールド、102のサブフィールドにまたがる20Kの厳密にキュレートされたマルチモーダルクエリで構成されている。インターリーブされたインプットとアウトプットの両方をサポートし、多様な能力を評価するために、複数選択とオープンな質問フォーマットの混合を提供する。
参考スコア（独自算出の注目度）: 71.36392373876505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
Abstract（参考訳）: モデルが任意のシーケンスで画像とテキストを生成・解釈できるインターリーブ型マルチモーダル理解と生成は、マルチモーダル学習において重要な領域となっている。大幅な進歩にもかかわらず、この能力の評価は依然として不十分である。既存のベンチマークは、データスケール、スコープ、評価の深さの制限に悩まされる一方、現在の評価指標は、しばしばコストやバイアスがかかり、実用的なアプリケーションに対する信頼性が欠如している。これらの課題に対処するために,LVLM(Large Vision-Language Models)におけるインターリーブ型マルチモーダル理解と生成を評価するための大規模知識集約型ベンチマークMMIEを紹介する。 MMIEは、数学、コーディング、物理学、文学、健康、芸術を含む3つのカテゴリ、12のフィールド、102のサブフィールドにまたがる20Kの厳密にキュレートされたマルチモーダルクエリで構成されている。インターリーブされたインプットとアウトプットの両方をサポートし、多様な能力を評価するために、複数選択とオープンな質問フォーマットの混合を提供する。また, 評価精度の向上とバイアス低減を目的とした評価基準と, 微調整されたスコアリングモデルを用いた信頼性の高い自動評価指標を提案する。大規模実験により, インターリーブLVLMの総合評価を行う上で, ベンチマークと指標の有効性が示された。具体的には、8つのLVLMを評価し、最高のモデルでさえ改善の余地があり、そのほとんどは適度な結果しか得られないことを示した。我々はMMIEがインターリーブLVLMの開発にさらなる進歩をもたらすと信じている。ベンチマークとコードはhttps://mmie-bench.github.io/で公開しています。

論文の概要: MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

関連論文リスト