Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
- URL: http://arxiv.org/abs/2602.02140v1
- Date: Mon, 02 Feb 2026 14:19:37 GMT
- Title: Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
- Authors: Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, Tianyi Zhou,
- Abstract summary: GapEval is a benchmark designed to quantify the gap between understanding and generation capabilities.<n>Experiments reveal a persistent gap between the two directions across a wide range of UMMs.<n>Our findings indicate that knowledge within UMMs often remains disjoint.
- Score: 66.07644743841007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two "unified" directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.
Related papers
- UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z) - Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark [69.8473923357969]
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
arXiv Detail & Related papers (2025-10-15T17:10:35Z) - Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z) - Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
Multi-modal learning has emerged as a crucial research direction.<n>Existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies.<n>We propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion.<n>We show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment.
arXiv Detail & Related papers (2025-05-25T07:26:00Z) - A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.<n>With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.<n>We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z) - Independence Constrained Disentangled Representation Learning from Epistemological Perspective [13.51102815877287]
Disentangled Representation Learning aims to improve the explainability of deep learning methods by training a data encoder that identifies semantically meaningful latent variables in the data generation process.
There is no consensus regarding the objective of disentangled representation learning.
We propose a novel method for disentangled representation learning by employing an integration of mutual information constraint and independence constraint.
arXiv Detail & Related papers (2024-09-04T13:00:59Z) - Class-wise Activation Unravelling the Engima of Deep Double Descent [0.0]
Double descent presents a counter-intuitive aspect within the machine learning domain.
In this study, we revisited the phenomenon of double descent and discussed the conditions of its occurrence.
arXiv Detail & Related papers (2024-05-13T12:07:48Z) - Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes [0.9120312014267044]
We investigate how a model's mechanisms reveal its causal effect on evidence-based decision-making.
This work furthers the dialogue on Causality and XAI.
arXiv Detail & Related papers (2023-08-19T13:14:15Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Unsupervised Discovery, Control, and Disentanglement of Semantic
Attributes with Applications to Anomaly Detection [15.817227809141116]
We focus on unsupervised generative representations that discover latent factors controlling image semantic attributes.
For (a), we propose a network architecture that exploits the combination of multiscale generative models with mutual information (MI)
For (b), we derive an analytical result (Lemma 1) that brings clarity to two related but distinct concepts.
arXiv Detail & Related papers (2020-02-25T20:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.