Related papers: Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

URL: http://arxiv.org/abs/2601.04706v1
Date: Thu, 08 Jan 2026 08:18:44 GMT
Title: Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models
Authors: Yanbing Zeng, Jia Wang, Hanghang Ma, Junqiang Wu, Jie Zhu, Xiaoming Wei, Jie Hu,
Abstract summary: This paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images.<n>We propose Forge-and-Quench, a new unified framework that puts this principle into practice.<n>Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models.
Score: 23.529904770014735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.

Related papers

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation [43.98469957837991]
We propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG)<n>The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process.<n>Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods.
arXiv Detail & Related papers (2025-09-23T04:52:39Z)
Unified Multimodal Model as Auto-Encoder [69.38946823657592]
We introduce a paradigm regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text.<n>Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception.
arXiv Detail & Related papers (2025-09-11T17:57:59Z)
Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning [92.57052246970254]
We propose to enable the collaborative co-evolution of visual comprehension and generation.<n>We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT.<n>We unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation.
arXiv Detail & Related papers (2025-06-02T09:39:28Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.<n>ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.<n>We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z)
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model [38.61292051733335]
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework.<n>VarGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation.<n> Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks.
arXiv Detail & Related papers (2025-01-21T17:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.