ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models
- URL: http://arxiv.org/abs/2311.02692v1
- Date: Sun, 5 Nov 2023 16:01:40 GMT
- Title: ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models
- Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao,
Jing Shao
- Abstract summary: Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
- Score: 49.48109472893714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in
interacting with visual content with myriad potential downstream tasks.
However, even though a list of benchmarks has been proposed, the capabilities
and limitations of MLLMs are still not comprehensively understood, due to a
lack of a standardized and holistic evaluation framework. To this end, we
present the first Comprehensive Evaluation Framework (ChEF) that can
holistically profile each MLLM and fairly compare different MLLMs. First, we
structure ChEF as four modular components, i.e., Scenario as scalable
multimodal datasets, Instruction as flexible instruction retrieving formulae,
Inferencer as reliable question answering strategies, and Metric as indicative
task-specific score functions. Based on them, ChEF facilitates versatile
evaluations in a standardized framework, and new evaluations can be built by
designing new Recipes (systematic selection of these four components). Notably,
current MLLM benchmarks can be readily summarized as recipes of ChEF. Second,
we introduce 6 new recipes to quantify competent MLLMs' desired capabilities
(or called desiderata, i.e., calibration, in-context learning, instruction
following, language performance, hallucination, and robustness) as reliable
agents that can perform real-world multimodal interactions. Third, we conduct a
large-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata.
Our evaluation summarized over 20 valuable observations concerning the
generalizability of MLLMs across various scenarios and the composite capability
of MLLMs required for multimodal interactions. We will publicly release all the
detailed implementations for further analysis, as well as an easy-to-use
modular toolkit for the integration of new recipes and models, so that ChEF can
be a growing evaluation framework for the MLLM community.
Related papers
- Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [44.401826163314716]
We propose a new evaluation paradigm for MLLMs using potent MLLM as the judge.
We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models.
The validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.