MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
- URL: http://arxiv.org/abs/2504.03641v2
- Date: Mon, 07 Apr 2025 16:12:54 GMT
- Title: MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
- Authors: Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan,
- Abstract summary: Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs)<n>We present a comprehensive evaluation framework designed to systematically assess U-MLLMs.
- Score: 46.86468865942724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.
Related papers
- Judge Anything: MLLM as a Judge Across Any Modality [43.51517213949702]
This paper introduces two benchmarks, TaskAnything and JudgeAnything, to evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks.<n> TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks.<n>JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation.<n>Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences.
arXiv Detail & Related papers (2025-03-21T18:59:20Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.
We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.
We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries.
We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs.
Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z) - HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We introduce HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.
We find that synthetic tasks like NIAH do not reliably predict downstream performance.
While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [153.37868034779385]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.