Fugu-MT 論文翻訳(概要): ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

論文の概要: ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2311.02692v1
Date: Sun, 5 Nov 2023 16:01:40 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-07 16:14:20.031266
Title: ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models
Title（参考訳）: ChEF: マルチモーダル大言語モデルの標準化評価のための総合的評価フレームワーク
Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao
Abstract要約: MLLM(Multimodal Large Language Models)は、視覚コンテンツと無数の下流タスクとを相互作用する優れた能力を示す。本稿では,各MLLMを全体プロファイルし,異なるMLLMを比較した最初の総合評価フレームワーク(ChEF)を提案する。詳細な実装をすべて公開して、さらなる分析と、新しいレシピやモデルを統合するための使い易いモジュラーツールキットを提供します。
参考スコア（独自算出の注目度）: 49.48109472893714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks. However, even though a list of benchmarks has been proposed, the capabilities and limitations of MLLMs are still not comprehensively understood, due to a lack of a standardized and holistic evaluation framework. To this end, we present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs. First, we structure ChEF as four modular components, i.e., Scenario as scalable multimodal datasets, Instruction as flexible instruction retrieving formulae, Inferencer as reliable question answering strategies, and Metric as indicative task-specific score functions. Based on them, ChEF facilitates versatile evaluations in a standardized framework, and new evaluations can be built by designing new Recipes (systematic selection of these four components). Notably, current MLLM benchmarks can be readily summarized as recipes of ChEF. Second, we introduce 6 new recipes to quantify competent MLLMs' desired capabilities (or called desiderata, i.e., calibration, in-context learning, instruction following, language performance, hallucination, and robustness) as reliable agents that can perform real-world multimodal interactions. Third, we conduct a large-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata. Our evaluation summarized over 20 valuable observations concerning the generalizability of MLLMs across various scenarios and the composite capability of MLLMs required for multimodal interactions. We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、視覚コンテンツと無数の下流タスクとを相互作用する優れた能力を示す。しかしながら、ベンチマークのリストが提案されているにもかかわらず、MLLMの機能と制限は、標準化された全体評価フレームワークが欠如しているため、まだ包括的に理解されていない。この目的のために,各MLLMを一意にプロファイルし,異なるMLLMを比較した最初の総合評価フレームワーク(ChEF)を提案する。まず、ChEFをスケーラブルなマルチモーダルデータセットとしてのシナリオ、フレキシブルな命令検索公式としてのインストラクション、信頼性の高い質問応答戦略としての推論、指示型タスク固有スコア関数としてのメトリックの4つのモジュラーコンポーネントとして構成する。それらに基づいて、ChEFは標準化されたフレームワークでの汎用的な評価を促進し、新しいレシピ(これら4つのコンポーネントの体系的な選択)を設計することで、新しい評価を構築することができる。現在のMLLMベンチマークはChEFのレシピとして簡単に要約できる。第2に,実世界のマルチモーダルインタラクションを実現するための信頼性のあるエージェントとして,MLLMの求める能力(デシラタ,キャリブレーション,インコンテキスト学習,命令追従,言語パフォーマンス,幻覚,堅牢性)を定量化する新しいレシピを6つ導入する。第3に,9つのシナリオと6つのデシラタにおいて,MLLMを大規模に評価する。本評価では,MLLMの一般化可能性および多モード相互作用に必要なMLLMの複合能力について,20以上の貴重な知見を要約した。我々は、さらなる分析のための詳細な実装と、新しいレシピやモデルを統合するための使い易いモジュールツールキットを公開し、ChEFがMLLMコミュニティのさらなる評価フレームワークになるようにします。

論文の概要: ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

関連論文リスト