MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
- URL: http://arxiv.org/abs/2505.17613v1
- Date: Fri, 23 May 2025 08:21:28 GMT
- Title: MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
- Authors: Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu,
- Abstract summary: MMMG is a comprehensive benchmark for multimodal generation across 4 modality combinations.<n>It is highly aligned with human evaluation, achieving an average agreement of 94.3%.<n>GPT Image achieves 78.3% accuracy for image generation, but falls short on multimodal reasoning and interleaved generation.
- Score: 81.26818054877658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.
Related papers
- A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation [14.590341095970883]
We introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method.<n>InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses.<n>To address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge.
arXiv Detail & Related papers (2025-06-11T06:21:20Z) - MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration [63.31211701741323]
We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement.<n>We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing.<n>We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance.
arXiv Detail & Related papers (2025-03-19T14:46:53Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework for AGI quality assessment.<n>It includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated.<n>Experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-02-21T03:05:45Z) - OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation [59.53678957969471]
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks.<n> generating interleaved image-text content remains a challenge.<n>OpenING is a benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks.<n>IntJudge is a judge model for evaluating open-ended multimodal generation methods.
arXiv Detail & Related papers (2024-11-27T16:39:04Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - REBUS: A Robust Evaluation Benchmark of Understanding Symbols [1.90463290938268]
GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models.
Even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles.
Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
arXiv Detail & Related papers (2024-01-11T00:30:28Z) - Assessor360: Multi-sequence Network for Blind Omnidirectional Image
Quality Assessment [50.82681686110528]
Blind Omnidirectional Image Quality Assessment (BOIQA) aims to objectively assess the human perceptual quality of omnidirectional images (ODIs)
The quality assessment of ODIs is severely hampered by the fact that the existing BOIQA pipeline lacks the modeling of the observer's browsing process.
We propose a novel multi-sequence network for BOIQA called Assessor360, which is derived from the realistic multi-assessor ODI quality assessment procedure.
arXiv Detail & Related papers (2023-05-18T13:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.