MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks
- URL: http://arxiv.org/abs/2310.09036v1
- Date: Fri, 13 Oct 2023 11:57:04 GMT
- Title: MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks
- Authors: Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li,
Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria
- Abstract summary: We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
- Score: 56.60050181186531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The popularity of multimodal large language models (MLLMs) has triggered a
recent surge in research efforts dedicated to evaluating these models.
Nevertheless, existing evaluation studies of MLLMs primarily focus on the
comprehension and reasoning of unimodal (vision) content, neglecting
performance evaluations in the domain of multimodal (vision-language) content
understanding. Beyond multimodal reasoning, tasks related to multimodal content
comprehension necessitate a profound understanding of multimodal contexts,
achieved through the multimodal interaction to obtain a final answer. In this
paper, we introduce a comprehensive assessment framework called MM-BigBench,
which incorporates a diverse range of metrics to offer an extensive evaluation
of the performance of various models and instructions across a wide spectrum of
diverse multimodal content comprehension tasks. Consequently, our work
complements research on the performance of MLLMs in multimodal comprehension
tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To
begin, we employ the Best Performance metric to ascertain each model's
performance upper bound on different datasets. Subsequently, the Mean Relative
Gain metric offers an assessment of the overall performance of various models
and instructions, while the Stability metric measures their sensitivity.
Furthermore, previous research centers on evaluating models independently or
solely assessing instructions, neglecting the adaptability between models and
instructions. We propose the Adaptability metric to quantify the adaptability
between models and instructions. Our paper evaluates a total of 20 language
models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10
instructions for each task, and derives novel insights. Our code will be
released at https://github.com/declare-lab/MM-BigBench.
Related papers
- MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs [47.94710556156627]
MIA-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.
Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions.
arXiv Detail & Related papers (2024-07-01T17:53:35Z) - IWISDM: Assessing instruction following in multimodal models at scale [1.2320972303448239]
We introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks.
Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels.
Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models.
arXiv Detail & Related papers (2024-06-20T14:09:54Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Model Composition for Multimodal Large Language Models [73.70317850267149]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.