Related papers: MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player?

URL: http://arxiv.org/abs/2307.06281v4
Date: Mon, 29 Apr 2024 15:21:19 GMT
Title: MMBench: Is Your Multi-modal Model an All-around Player?
Authors: Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin,
Abstract summary: How to evaluate large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models.
Score: 114.45702807380415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.

Related papers

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Evaluating and Advancing Multimodal Large Language Models in Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs across six key perception abilities. We identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models. We also design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict.
arXiv Detail & Related papers (2024-11-22T04:41:20Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models. LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency. Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z)
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models [34.843603169616486]
We introduce AlignMMBench, a benchmark for evaluating the alignment capabilities of large Vision-Language Models (VLMs)<n>This benchmark is meticulously curated from real-world scenarios and internet sources, and includes both single-turn and multi-turn dialogue scenarios.<n>We also develop CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability.
arXiv Detail & Related papers (2024-06-13T16:30:14Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension [27.53415400454066]
We introduce a benchmark named SEED-Bench to assess generative models. SEED-Bench consists of 19K multiple choice questions with accurate human annotations. We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
arXiv Detail & Related papers (2023-07-30T04:25:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.