Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation
- URL: http://arxiv.org/abs/2509.18639v3
- Date: Thu, 25 Sep 2025 08:19:34 GMT
- Title: Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation
- Authors: Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, Xuming Hu,
- Abstract summary: We propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG)<n>The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process.<n>Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods.
- Score: 43.98469957837991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce "Image Editing" as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG
Related papers
- Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models [23.529904770014735]
This paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images.<n>We propose Forge-and-Quench, a new unified framework that puts this principle into practice.<n>Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models.
arXiv Detail & Related papers (2026-01-08T08:18:44Z) - Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z) - X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again [45.74833463136701]
We develop a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation termed X- Omni.<n>X- Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
arXiv Detail & Related papers (2025-07-29T17:59:04Z) - UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens [54.40871421476035]
We present UniCTokens, a framework that integrates personalized information into a unified vision language model (VLM) for understanding and generation.<n>UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks.<n>Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding.
arXiv Detail & Related papers (2025-05-20T17:56:01Z) - Boosting Generative Image Modeling via Joint Image-Feature Synthesis [15.133906625258797]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents.<n>Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise.<n>By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing [66.33454784945293]
Generation Chain-of-Thought (GoT) is a novel paradigm that enables generation and editing through an explicit language reasoning process.<n>GoT transforms conventional text-to-image generation and editing into a reasoning-guided framework.
arXiv Detail & Related papers (2025-03-13T17:59:59Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.