Related papers: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

URL: http://arxiv.org/abs/2506.10963v2
Date: Fri, 13 Jun 2025 04:39:54 GMT
Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian,
Abstract summary: We introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG)<n>MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps.<n>We introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment.
Score: 20.382087716921003
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.

Related papers

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation [14.95468978198402]
OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation.<n>Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful large language models.
arXiv Detail & Related papers (2025-06-03T17:59:33Z)
SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model [21.81341169834812]
SridBench is the first benchmark for scientific figure generation.<n>It comprises 1,120 instances from leading scientific papers across 13 natural and computer science disciplines.<n>Results reveal that even top-tier models like GPT-4o-image lag behind human performance.
arXiv Detail & Related papers (2025-05-28T08:51:01Z)
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models [42.91502354577658]
MMIG-Bench is a comprehensive Multi-Modal Image Generation Benchmark.<n>It pairs 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects.<n>Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter.
arXiv Detail & Related papers (2025-05-26T02:07:24Z)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z)
Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision [8.210681499876216]
We introduce intra-class memorability, where certain images within the same class are more memorable than others.<n>We propose the Intra-Class Memorability score (ICMscore), a novel metric that incorporates the temporal intervals between repeated image presentations into its calculation.<n>We curate the Intra-Class Memorability dataset (ICMD), comprising over 5,000 images across ten object classes with their ICMscores derived from 2,000 participants' responses.
arXiv Detail & Related papers (2024-12-30T07:09:28Z)
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.16022378880376]
We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench.<n>MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions.<n>Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
arXiv Detail & Related papers (2024-10-10T17:55:02Z)
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs) MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z)
Categorical Knowledge Fused Recognition: Fusing Hierarchical Knowledge with Image Classification through Aligning and Deep Metric Learning [18.534970504136254]
We propose a novel deep metric learning based method to fuse prior knowledge about image categories with mainstream backbone image classification models.<n>The proposed method is effective in enhancing the reasoning aspect of image recognition in terms of weakly-supervised object localization performance.
arXiv Detail & Related papers (2024-07-30T07:24:33Z)
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z)
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [96.72318842152148]
We propose a unified framework for text-to-image generation and retrieval with one single Large Multimodal Model (LMM)<n> Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner.<n>We then propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt.
arXiv Detail & Related papers (2024-06-09T15:00:28Z)
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z)
MuMIC -- Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid [1.1452732046200158]
Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification. We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function. MuMIC is capable of providing high classification performance, handling real-world noisy data, supporting zero-shot predictions, and producing domain-specific image embeddings.
arXiv Detail & Related papers (2022-11-02T17:29:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.