Related papers: Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

URL: http://arxiv.org/abs/2510.13759v1
Date: Wed, 15 Oct 2025 17:10:35 GMT
Title: Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
Authors: Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, Ziwei Liu,
Abstract summary: Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
Score: 69.8473923357969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.

Related papers

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark [72.37370242707432]
This paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset.<n>UniM contains 31K high-quality instances across 30 domains and 7 representative modalities.<n>We also introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence.
arXiv Detail & Related papers (2026-03-05T11:45:16Z)
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z)
Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking [154.2388970262703]
Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework.<n>We introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that alternates between analytic and drafting operations.<n>By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy.
arXiv Detail & Related papers (2026-02-24T23:26:09Z)
Quantifying the Gap between Understanding and Generation within Unified Multimodal Models [66.07644743841007]
GapEval is a benchmark designed to quantify the gap between understanding and generation capabilities.<n>Experiments reveal a persistent gap between the two directions across a wide range of UMMs.<n>Our findings indicate that knowledge within UMMs often remains disjoint.
arXiv Detail & Related papers (2026-02-02T14:19:37Z)
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision [34.575729271291436]
We formalize Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis.<n>We propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision.<n>To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop.
arXiv Detail & Related papers (2026-01-06T17:15:50Z)
Bridging the Gap Between Multimodal Foundation Models and World Models [10.001347956177879]
We investigate what it takes to bridge the gap between multimodal foundation models and world models.<n>Our approaches incorporate scene graphs, multimodal conditioning and alignment strategies to guide the generation process.<n>We extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
arXiv Detail & Related papers (2025-10-04T08:14:20Z)
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark [71.3555284685426]
We introduce RealUnify, a benchmark designed to evaluate bidirectional capability synergy.<n>RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks.<n>We find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient.
arXiv Detail & Related papers (2025-09-29T15:07:28Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
Complementarity-driven Representation Learning for Multi-modal Knowledge Graph Completion [0.0]
We propose a novel framework named Mixture of Complementary Modality Experts (MoCME)<n>MoCME consists of a Complementarity-guided Modality Knowledge Fusion (CMKF) module and an Entropy-guided Negative Sampling (EGNS) mechanism.<n>Our MoCME achieves state-of-the-art performance, surpassing existing approaches.
arXiv Detail & Related papers (2025-07-28T08:35:11Z)
SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards [55.99492656542475]
We propose textbfSUDER (textbfSelf-improving textbfUnified LMMs with textbfDual stextbfElf-textbfRewards), a framework reinforcing the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.