MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
- URL: http://arxiv.org/abs/2409.02813v2
- Date: Tue, 10 Sep 2024 12:55:31 GMT
- Title: MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
- Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig,
- Abstract summary: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark.
MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities.
- Score: 77.93283927871758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation [38.076276626337766]
MMEvalPro is a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics.
MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions.
Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging.
arXiv Detail & Related papers (2024-06-29T15:28:45Z) - Examining Modality Incongruity in Multimodal Federated Learning for
Medical Vision and Language-based Disease Detection [7.515840210206994]
The impact of missing modality in different clients, also called modality incongruity, has been greatly overlooked.
This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients.
arXiv Detail & Related papers (2024-02-07T22:16:53Z) - VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal
Models [19.32035955420203]
We conduct the first comprehensive analysis of Large Multimodal Models (LMMs) using a variety of visual referring prompting strategies.
We develop an automated assessment framework to evaluate the accuracy of LMMs without the need for human intervention or manual labeling.
We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%.
arXiv Detail & Related papers (2023-12-07T06:53:55Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z) - DUMA: Reading Comprehension with Transposition Thinking [107.89721765056281]
Multi-choice Machine Reading (MRC) requires model to decide the correct answer from a set of answer options when given a passage and a question.
New DUal Multi-head Co-Attention (DUMA) model is inspired by human's transposition thinking process solving the multi-choice MRC problem.
arXiv Detail & Related papers (2020-01-26T07:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.