MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks
- URL: http://arxiv.org/abs/2310.09036v1
- Date: Fri, 13 Oct 2023 11:57:04 GMT
- Title: MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks
- Authors: Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li,
Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria
- Abstract summary: We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
- Score: 56.60050181186531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The popularity of multimodal large language models (MLLMs) has triggered a
recent surge in research efforts dedicated to evaluating these models.
Nevertheless, existing evaluation studies of MLLMs primarily focus on the
comprehension and reasoning of unimodal (vision) content, neglecting
performance evaluations in the domain of multimodal (vision-language) content
understanding. Beyond multimodal reasoning, tasks related to multimodal content
comprehension necessitate a profound understanding of multimodal contexts,
achieved through the multimodal interaction to obtain a final answer. In this
paper, we introduce a comprehensive assessment framework called MM-BigBench,
which incorporates a diverse range of metrics to offer an extensive evaluation
of the performance of various models and instructions across a wide spectrum of
diverse multimodal content comprehension tasks. Consequently, our work
complements research on the performance of MLLMs in multimodal comprehension
tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To
begin, we employ the Best Performance metric to ascertain each model's
performance upper bound on different datasets. Subsequently, the Mean Relative
Gain metric offers an assessment of the overall performance of various models
and instructions, while the Stability metric measures their sensitivity.
Furthermore, previous research centers on evaluating models independently or
solely assessing instructions, neglecting the adaptability between models and
instructions. We propose the Adaptability metric to quantify the adaptability
between models and instructions. Our paper evaluates a total of 20 language
models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10
instructions for each task, and derives novel insights. Our code will be
released at https://github.com/declare-lab/MM-BigBench.
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - IWISDM: Assessing instruction following in multimodal models at scale [1.2320972303448239]
We introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks.
Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels.
Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models.
arXiv Detail & Related papers (2024-06-20T14:09:54Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.