Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
- URL: http://arxiv.org/abs/2601.21406v2
- Date: Fri, 30 Jan 2026 09:22:34 GMT
- Title: Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
- Authors: Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, Xiangxiang Chu,
- Abstract summary: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework.<n>We propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method.<n>Our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
- Score: 53.18286807225952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
Related papers
- UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z) - TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models [96.41974190202642]
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework.<n>We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder.
arXiv Detail & Related papers (2025-12-01T18:59:51Z) - Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark [69.8473923357969]
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
arXiv Detail & Related papers (2025-10-15T17:10:35Z) - Bridging the Gap Between Multimodal Foundation Models and World Models [10.001347956177879]
We investigate what it takes to bridge the gap between multimodal foundation models and world models.<n>Our approaches incorporate scene graphs, multimodal conditioning and alignment strategies to guide the generation process.<n>We extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
arXiv Detail & Related papers (2025-10-04T08:14:20Z) - UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z) - A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation [8.021435739965982]
We propose a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation.<n>In Cognition, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning.<n>The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner.
arXiv Detail & Related papers (2025-08-14T09:52:51Z) - ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability [14.703591553247948]
ARMOR is a resource-efficient and pure autoregressive framework for multimodal large language models.<n>It achieves both understanding and generation by fine-tuning existing MLLMs.<n>We show that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources.
arXiv Detail & Related papers (2025-03-09T10:15:39Z) - VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model [38.61292051733335]
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework.<n>VarGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation.<n> Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks.
arXiv Detail & Related papers (2025-01-21T17:50:43Z) - Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.79428219851289]
Inst-IT is a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.<n>Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm.
arXiv Detail & Related papers (2024-12-04T18:58:10Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.