MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
- URL: http://arxiv.org/abs/2506.10963v2
- Date: Fri, 13 Jun 2025 04:39:54 GMT
- Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
- Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian,
- Abstract summary: We introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG)<n>MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps.<n>We introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment.
- Score: 20.382087716921003
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
Related papers
- UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation [14.95468978198402]
OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation.<n>Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful large language models.
arXiv Detail & Related papers (2025-06-03T17:59:33Z) - SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model [21.81341169834812]
SridBench is the first benchmark for scientific figure generation.<n>It comprises 1,120 instances from leading scientific papers across 13 natural and computer science disciplines.<n>Results reveal that even top-tier models like GPT-4o-image lag behind human performance.
arXiv Detail & Related papers (2025-05-28T08:51:01Z) - MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models [42.91502354577658]
MMIG-Bench is a comprehensive Multi-Modal Image Generation Benchmark.<n>It pairs 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects.<n>Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter.
arXiv Detail & Related papers (2025-05-26T02:07:24Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision [8.210681499876216]
We introduce intra-class memorability, where certain images within the same class are more memorable than others.<n>We propose the Intra-Class Memorability score (ICMscore), a novel metric that incorporates the temporal intervals between repeated image presentations into its calculation.<n>We curate the Intra-Class Memorability dataset (ICMD), comprising over 5,000 images across ten object classes with their ICMscores derived from 2,000 participants' responses.
arXiv Detail & Related papers (2024-12-30T07:09:28Z) - MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.16022378880376]
We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench.<n>MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions.<n>Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
arXiv Detail & Related papers (2024-10-10T17:55:02Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs)
MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions.
Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z) - Categorical Knowledge Fused Recognition: Fusing Hierarchical Knowledge with Image Classification through Aligning and Deep Metric Learning [18.534970504136254]
We propose a novel deep metric learning based method to fuse prior knowledge about image categories with mainstream backbone image classification models.<n>The proposed method is effective in enhancing the reasoning aspect of image recognition in terms of weakly-supervised object localization performance.
arXiv Detail & Related papers (2024-07-30T07:24:33Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [96.72318842152148]
We propose a unified framework for text-to-image generation and retrieval with one single Large Multimodal Model (LMM)<n> Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner.<n>We then propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - MuMIC -- Multimodal Embedding for Multi-label Image Classification with
Tempered Sigmoid [1.1452732046200158]
Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification.
We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function.
MuMIC is capable of providing high classification performance, handling real-world noisy data, supporting zero-shot predictions, and producing domain-specific image embeddings.
arXiv Detail & Related papers (2022-11-02T17:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.